使用模型预测控制来处理加强学习中稀疏奖励

论文标题

使用模型预测控制来处理加强学习中稀疏奖励

Handling Sparse Rewards in Reinforcement Learning Using Model Predictive Control

论文作者

Dawood, Murad, Dengler, Nils, de Heuvel, Jorge, Bennewitz, Maren

论文摘要

强化学习（RL）最近已证明在各个领域取得了巨大的成功。但是，奖励功能的设计需要详细的领域专业知识和繁琐的微调，以确保代理能够学习所需的行为。使用稀疏的奖励方便地减轻这些挑战。但是，稀疏的奖励本身就是一个挑战，通常导致对代理商的培训不成功。因此，在本文中，我们解决了RL中稀疏的奖励问题。我们的目标是找到一种有效的替代方法，不使用昂贵的人类示威活动，也适用于各种领域。因此，我们建议使用模型预测控制〜（MPC）作为在稀疏奖励环境中培训RL代理的体验来源。在不需要奖励塑造的情况下，我们成功地将我们的方法应用于移动机器人导航领域，无论是在模拟和现实世界实验中，都具有Kuboki Turtlebot 2。我们此外，我们在成功率以及碰撞数量和碰撞的数量方面表现出了对纯RL算法的巨大改进。我们的实验表明，在稀疏奖励的情况下，MPC作为体验来源改善了代理商的学习过程。

Reinforcement learning (RL) has recently proven great success in various domains. Yet, the design of the reward function requires detailed domain expertise and tedious fine-tuning to ensure that agents are able to learn the desired behaviour. Using a sparse reward conveniently mitigates these challenges. However, the sparse reward represents a challenge on its own, often resulting in unsuccessful training of the agent. In this paper, we therefore address the sparse reward problem in RL. Our goal is to find an effective alternative to reward shaping, without using costly human demonstrations, that would also be applicable to a wide range of domains. Hence, we propose to use model predictive control~(MPC) as an experience source for training RL agents in sparse reward environments. Without the need for reward shaping, we successfully apply our approach in the field of mobile robot navigation both in simulation and real-world experiments with a Kuboki Turtlebot 2. We furthermore demonstrate great improvement over pure RL algorithms in terms of success rate as well as number of collisions and timeouts. Our experiments show that MPC as an experience source improves the agent's learning process for a given task in the case of sparse rewards.

下载PDF全文

下载文献需遵守相关版权规定

论文标题