论文标题

通过基于模型的离线优化的部署有效的强化学习

Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization

论文作者

Matsushima, Tatsuya, Furuta, Hiroki, Matsuo, Yutaka, Nachum, Ofir, Gu, Shixiang

论文摘要

大多数强化学习(RL)算法都会在线访问环境,其中可能会随着使用该策略的经验收集来交织到该策略的更新。但是,在许多现实世界中的应用程序,例如健康,教育,对话代理和机器人技术中,部署新的数据收集政策的成本或潜在风险很高,以至于在学习过程中更新数据收集政策可能会变得越来越多。通过这种观点,我们提出了一个新颖的部署效率概念,测量了政策学习过程中使用的不同数据收集政策的数量。我们观察到,天真地应用现有的无模型离线RL算法并不会导致实用的部署效率和样品有效算法。我们提出了一种基于模型的新型算法,行为注册的模型 - 综述(BREMEN),该算法可以使用比以前的工作少10-20倍的10-20倍的离线优化策略。此外,不来梅的递归应用能够达到令人印象深刻的部署效率,同时保持相同或更好的样品效率,从仅5-10个部署的模拟机器人环境中学习成功的策略,而标准RL基线的典型值到数百万到百万。代码和预训练模型可在https://github.com/matsuolab/bremen上找到。

Most reinforcement learning (RL) algorithms assume online access to the environment, in which one may readily interleave updates to the policy with experience collection using that policy. However, in many real-world applications such as health, education, dialogue agents, and robotics, the cost or potential risk of deploying a new data-collection policy is high, to the point that it can become prohibitive to update the data-collection policy more than a few times during learning. With this view, we propose a novel concept of deployment efficiency, measuring the number of distinct data-collection policies that are used during policy learning. We observe that naïvely applying existing model-free offline RL algorithms recursively does not lead to a practical deployment-efficient and sample-efficient algorithm. We propose a novel model-based algorithm, Behavior-Regularized Model-ENsemble (BREMEN) that can effectively optimize a policy offline using 10-20 times fewer data than prior works. Furthermore, the recursive application of BREMEN is able to achieve impressive deployment efficiency while maintaining the same or better sample efficiency, learning successful policies from scratch on simulated robotic environments with only 5-10 deployments, compared to typical values of hundreds to millions in standard RL baselines. Codes and pre-trained models are available at https://github.com/matsuolab/BREMEN .

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源