通过几何政策组成的一般政策改进

论文标题

通过几何政策组成的一般政策改进

Generalised Policy Improvement with Geometric Policy Composition

论文作者

Thakoor, Shantanu, Rowland, Mark, Borsa, Diana, Dabney, Will, Munos, Rémi, Barreto, André

论文摘要

我们介绍了一种改进政策改进的方法，该方法在基于价值的强化学习（RL）的贪婪方法与基于模型的RL的典型计划方法之间进行了插值。新方法建立在几何视野模型（GHM，也称为伽马模型）的概念上，该模型对给定策略的折现状态访问分布进行了建模。我们表明，我们可以通过仔细的基本策略GHM的仔细组成，而没有任何其他学习，可以评估任何非马尔可夫政策，以固定的概率在一组基本马尔可夫策略之间切换。然后，我们可以将广义政策改进（GPI）应用于此类非马尔科夫政策的收集，以获得新的马尔可夫政策，通常将其表现优于其先驱。我们对这种方法提供了彻底的理论分析，开发了转移和标准RL的应用，并在经验上证明了其对标准GPI的有效性，这是在具有挑战性的深度RL连续控制任务上。我们还提供了GHM培训方法的分析，证明了关于先前提出的方法的新收敛结果，并显示了如何在深度RL设置中稳定训练这些模型。

We introduce a method for policy improvement that interpolates between the greedy approach of value-based reinforcement learning (RL) and the full planning approach typical of model-based RL. The new method builds on the concept of a geometric horizon model (GHM, also known as a gamma-model), which models the discounted state-visitation distribution of a given policy. We show that we can evaluate any non-Markov policy that switches between a set of base Markov policies with fixed probability by a careful composition of the base policy GHMs, without any additional learning. We can then apply generalised policy improvement (GPI) to collections of such non-Markov policies to obtain a new Markov policy that will in general outperform its precursors. We provide a thorough theoretical analysis of this approach, develop applications to transfer and standard RL, and empirically demonstrate its effectiveness over standard GPI on a challenging deep RL continuous control task. We also provide an analysis of GHM training methods, proving a novel convergence result regarding previously proposed methods and showing how to train these models stably in deep RL settings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题