论文标题

单分会自适应Q学习

Single-partition adaptive Q-learning

论文作者

Araújo, João Pedro, Figueiredo, Mário, Botto, Miguel Ayala

论文摘要

本文介绍了单一分区自适应Q学习(SPAQL),这是一种无模型的情节强化学习算法(RL),它可以自适应地划分马尔可夫决策过程(MDP)的状态行动空间,同时学习时间传播的策略(i。e。e。探索和剥削之间的权衡是通过在训练过程中使用上置信度边界(UCB)和Boltzmann勘探的混合物来处理的,并且温度参数随着训练的进行而自动调整。该算法是对自适应Q学习(AQL)的改进。它会收敛到最佳解决方案,同时使用较少的手臂。与AQL不同的时间步长的插曲的测试表明,SPAQL毫无疑问。基于这一经验证据,我们声称SPAQL的样品效率可能比AQL更高,因此是对有效无模型RL方法领域的相关贡献。

This paper introduces single-partition adaptive Q-learning (SPAQL), an algorithm for model-free episodic reinforcement learning (RL), which adaptively partitions the state-action space of a Markov decision process (MDP), while simultaneously learning a time-invariant policy (i. e., the mapping from states to actions does not depend explicitly on the episode time step) for maximizing the cumulative reward. The trade-off between exploration and exploitation is handled by using a mixture of upper confidence bounds (UCB) and Boltzmann exploration during training, with a temperature parameter that is automatically tuned as training progresses. The algorithm is an improvement over adaptive Q-learning (AQL). It converges faster to the optimal solution, while also using fewer arms. Tests on episodes with a large number of time steps show that SPAQL has no problems scaling, unlike AQL. Based on this empirical evidence, we claim that SPAQL may have a higher sample efficiency than AQL, thus being a relevant contribution to the field of efficient model-free RL methods.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源