论文标题
关于计划在基于模型的深入强化学习中的作用
On the role of planning in model-based deep reinforcement learning
论文作者
论文摘要
人们通常认为基于模型的计划对于人造代理中的深层,仔细的推理和概括是必要的。尽管具有深度功能近似值的基于模型的增强学习(MBRL)的最新成功增强了这一假设,但基于模型的方法的多样性也使跟踪哪些组件取得成功以及原因。在本文中,我们试图通过关注三个问题来解除最近方法的贡献:(1)计划如何使MBRL代理有益? (2)在计划中,哪些选择推动了性能? (3)计划在多大程度上改善了概括?为了回答这些问题,我们研究了Muzero的性能(Schrittwieser等,2019),这是一种最先进的MBRL算法,具有较强的连接和与许多其他MBRL算法的重叠组件。我们在各种环境中进行了多种干预措施和Muzero的消融,包括控制任务,Atari和9x9 GO。我们的结果表明以下内容:(1)计划在学习过程中最有用,无论是在策略更新和提供更有用的数据分发方面。 (2)除了最困难的推理任务外,使用简单的蒙特卡洛推出的浅树与更复杂的方法一样。 (3)仅计划不足以推动强有力的概括。这些结果表明如何在何处以及如何在强化学习环境中使用计划,并突出了许多开放问题,以供未来的MBRL研究。
Model-based planning is often thought to be necessary for deep, careful reasoning and generalization in artificial agents. While recent successes of model-based reinforcement learning (MBRL) with deep function approximation have strengthened this hypothesis, the resulting diversity of model-based methods has also made it difficult to track which components drive success and why. In this paper, we seek to disentangle the contributions of recent methods by focusing on three questions: (1) How does planning benefit MBRL agents? (2) Within planning, what choices drive performance? (3) To what extent does planning improve generalization? To answer these questions, we study the performance of MuZero (Schrittwieser et al., 2019), a state-of-the-art MBRL algorithm with strong connections and overlapping components with many other MBRL algorithms. We perform a number of interventions and ablations of MuZero across a wide range of environments, including control tasks, Atari, and 9x9 Go. Our results suggest the following: (1) Planning is most useful in the learning process, both for policy updates and for providing a more useful data distribution. (2) Using shallow trees with simple Monte-Carlo rollouts is as performant as more complex methods, except in the most difficult reasoning tasks. (3) Planning alone is insufficient to drive strong generalization. These results indicate where and how to utilize planning in reinforcement learning settings, and highlight a number of open questions for future MBRL research.