lookahead结合的Q学习

论文标题

lookahead结合的Q学习

Lookahead-Bounded Q-Learning

论文作者

Shar, Ibrahim El, Jiang, Daniel R.

论文摘要

我们介绍了由LookAhead结合的Q-Learning（LBQL）算法，这是一种新的，可证明的Q学习的变体，旨在通过使用“ lookahead”上和下限来改善随机环境中标准Q-学习的性能。为此，LBQL采用先前收集的经验和每种迭代的状态行动值作为双重可行惩罚，以构建一系列采样的信息放松问题。这些问题的解决方案在最佳值上提供了估计的上限和下限，我们通过随机近似跟踪。然后使用这些数量来限制迭代物，以保持在每次迭代的范围内。基准问题上的数值实验表明，与标准的Q学习和几种相关技术相比，LBQL对超参数表现出更快的收敛性和更鲁棒性。我们的方法在需要昂贵的模拟或现实互动的问题中特别有吸引力。

We introduce the lookahead-bounded Q-learning (LBQL) algorithm, a new, provably convergent variant of Q-learning that seeks to improve the performance of standard Q-learning in stochastic environments through the use of ``lookahead'' upper and lower bounds. To do this, LBQL employs previously collected experience and each iteration's state-action values as dual feasible penalties to construct a sequence of sampled information relaxation problems. The solutions to these problems provide estimated upper and lower bounds on the optimal value, which we track via stochastic approximation. These quantities are then used to constrain the iterates to stay within the bounds at every iteration. Numerical experiments on benchmark problems show that LBQL exhibits faster convergence and more robustness to hyperparameters when compared to standard Q-learning and several related techniques. Our approach is particularly appealing in problems that require expensive simulations or real-world interactions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题