论文标题
单分会自适应Q学习
Single-partition adaptive Q-learning
论文作者
论文摘要
本文介绍了单一分区自适应Q学习(SPAQL),这是一种无模型的情节强化学习算法(RL),它可以自适应地划分马尔可夫决策过程(MDP)的状态行动空间,同时学习时间传播的策略(i。e。e。探索和剥削之间的权衡是通过在训练过程中使用上置信度边界(UCB)和Boltzmann勘探的混合物来处理的,并且温度参数随着训练的进行而自动调整。该算法是对自适应Q学习(AQL)的改进。它会收敛到最佳解决方案,同时使用较少的手臂。与AQL不同的时间步长的插曲的测试表明,SPAQL毫无疑问。基于这一经验证据,我们声称SPAQL的样品效率可能比AQL更高,因此是对有效无模型RL方法领域的相关贡献。
This paper introduces single-partition adaptive Q-learning (SPAQL), an algorithm for model-free episodic reinforcement learning (RL), which adaptively partitions the state-action space of a Markov decision process (MDP), while simultaneously learning a time-invariant policy (i. e., the mapping from states to actions does not depend explicitly on the episode time step) for maximizing the cumulative reward. The trade-off between exploration and exploitation is handled by using a mixture of upper confidence bounds (UCB) and Boltzmann exploration during training, with a temperature parameter that is automatically tuned as training progresses. The algorithm is an improvement over adaptive Q-learning (AQL). It converges faster to the optimal solution, while also using fewer arms. Tests on episodes with a large number of time steps show that SPAQL has no problems scaling, unlike AQL. Based on this empirical evidence, we claim that SPAQL may have a higher sample efficiency than AQL, thus being a relevant contribution to the field of efficient model-free RL methods.