潜在的边缘化是改善探索的低成本方法

论文标题

潜在的边缘化是改善探索的低成本方法

Latent State Marginalization as a Low-cost Approach for Improving Exploration

论文作者

Zhang, Dinghuai, Courville, Aaron, Bengio, Yoshua, Zheng, Qinqing, Zhang, Amy, Chen, Ricky T. Q.

论文摘要

虽然通常从概率的角度来激励最大的熵（最大）增强学习（RL）框架（通常是为其探索和鲁棒性功能而吹捧），但由于其固有的复杂性，使用深层概率模型在实践中的使用并没有得到太多的吸引力。在这项工作中，我们建议在最大框架内采用潜在可变策略，我们证明这可以近似任何策略分布，此外，自然而然地出现在具有潜在信念状态的世界模型下。我们讨论了为什么难以训练潜在可变政策，天真的方法如何失败，然后引入一系列围绕潜在状态的低成本边缘化的改进，使我们能够以最少的额外费用充分利用潜在状态。我们在演员批评框架下实例化方法，使演员和评论家边缘化。所得的算法，称为随机边缘参与者 - 批评（SMAC），很简单而有效。我们在实验中验证了连续控制任务的方法，表明有效的边缘化可以导致更好的探索和更健壮的训练。我们的实施是通过https://github.com/zdhnarsil/stochastic-marginal-actor-critic开源的。

While the maximum entropy (MaxEnt) reinforcement learning (RL) framework -- often touted for its exploration and robustness capabilities -- is usually motivated from a probabilistic perspective, the use of deep probabilistic models has not gained much traction in practice due to their inherent complexity. In this work, we propose the adoption of latent variable policies within the MaxEnt framework, which we show can provably approximate any policy distribution, and additionally, naturally emerges under the use of world models with a latent belief state. We discuss why latent variable policies are difficult to train, how naive approaches can fail, then subsequently introduce a series of improvements centered around low-cost marginalization of the latent state, allowing us to make full use of the latent state at minimal additional cost. We instantiate our method under the actor-critic framework, marginalizing both the actor and critic. The resulting algorithm, referred to as Stochastic Marginal Actor-Critic (SMAC), is simple yet effective. We experimentally validate our method on continuous control tasks, showing that effective marginalization can lead to better exploration and more robust training. Our implementation is open sourced at https://github.com/zdhNarsil/Stochastic-Marginal-Actor-Critic.

下载PDF全文

下载文献需遵守相关版权规定

论文标题