单分会自适应Q学习

论文标题

单分会自适应Q学习

Single-partition adaptive Q-learning

论文作者

Araújo, João Pedro, Figueiredo, Mário, Botto, Miguel Ayala

论文摘要

本文介绍了单一分区自适应Q学习（SPAQL），这是一种无模型的情节强化学习算法（RL），它可以自适应地划分马尔可夫决策过程（MDP）的状态行动空间，同时学习时间传播的策略（i。e。e。探索和剥削之间的权衡是通过在训练过程中使用上置信度边界（UCB）和Boltzmann勘探的混合物来处理的，并且温度参数随着训练的进行而自动调整。该算法是对自适应Q学习（AQL）的改进。它会收敛到最佳解决方案，同时使用较少的手臂。与AQL不同的时间步长的插曲的测试表明，SPAQL毫无疑问。基于这一经验证据，我们声称SPAQL的样品效率可能比AQL更高，因此是对有效无模型RL方法领域的相关贡献。

This paper introduces single-partition adaptive Q-learning (SPAQL), an algorithm for model-free episodic reinforcement learning (RL), which adaptively partitions the state-action space of a Markov decision process (MDP), while simultaneously learning a time-invariant policy (i. e., the mapping from states to actions does not depend explicitly on the episode time step) for maximizing the cumulative reward. The trade-off between exploration and exploitation is handled by using a mixture of upper confidence bounds (UCB) and Boltzmann exploration during training, with a temperature parameter that is automatically tuned as training progresses. The algorithm is an improvement over adaptive Q-learning (AQL). It converges faster to the optimal solution, while also using fewer arms. Tests on episodes with a large number of time steps show that SPAQL has no problems scaling, unlike AQL. Based on this empirical evidence, we claim that SPAQL may have a higher sample efficiency than AQL, thus being a relevant contribution to the field of efficient model-free RL methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题