论文标题
Plangan:基于型号的计划,具有稀疏的奖励和多个目标
PlanGAN: Model-based Planning With Sparse Rewards and Multiple Goals
论文作者
论文摘要
稀疏奖励学习仍然是加强学习(RL)的重大挑战,尤其是当目标是培训能够实现多个不同目标的政策时。迄今为止,处理多目标,稀疏奖励环境的最成功方法是无模型的RL算法。在这项工作中,我们提出了Plangan,这是一种基于模型的算法,专门设计用于在稀疏奖励的环境中解决多目标任务。我们的方法基于以下事实:代理收集的任何经验轨迹都包含有关如何实现该轨迹期间观察到的目标的有用信息。我们使用它来训练有条件生成模型(GAN)的合奏,以生成合理的轨迹,将代理从其当前状态带到指定目标。然后,我们将这些想象中的轨迹结合到一种新型的计划算法中,以尽可能有效地实现所需的目标。与一系列无模型的强化学习基线相比,Plangan的性能已在许多机器人导航/操纵任务上进行了测试,包括Hindsight Experience重播。我们的研究表明,Plangan可以实现可比的性能,而样品效率高约4-8倍。
Learning with sparse rewards remains a significant challenge in reinforcement learning (RL), especially when the aim is to train a policy capable of achieving multiple different goals. To date, the most successful approaches for dealing with multi-goal, sparse reward environments have been model-free RL algorithms. In this work we propose PlanGAN, a model-based algorithm specifically designed for solving multi-goal tasks in environments with sparse rewards. Our method builds on the fact that any trajectory of experience collected by an agent contains useful information about how to achieve the goals observed during that trajectory. We use this to train an ensemble of conditional generative models (GANs) to generate plausible trajectories that lead the agent from its current state towards a specified goal. We then combine these imagined trajectories into a novel planning algorithm in order to achieve the desired goal as efficiently as possible. The performance of PlanGAN has been tested on a number of robotic navigation/manipulation tasks in comparison with a range of model-free reinforcement learning baselines, including Hindsight Experience Replay. Our studies indicate that PlanGAN can achieve comparable performance whilst being around 4-8 times more sample efficient.