论文标题

通过向后模型对一些演示的强大模仿

Robust Imitation of a Few Demonstrations with a Backwards Model

论文作者

Park, Jung Yeon, Wong, Lawson L. S.

论文摘要

专家演示的行为克隆可以加快以更有效的方式进行增强学习的方式学习最佳政策。但是,该策略不能很好地推断出演示数据之外看不见的状态,从而创造了协变量转移(代理人从演示中移开)并加剧了错误。在这项工作中,我们通过扩展示威游行的吸引力区域来解决这个问题,以便代理商可以学习如何返回演示的轨迹,如果它偏离道路。我们训练一个生成的向后动力学模型,并在演示中产生来自状态的简短想象的轨迹。通过模仿演示和这些模型推出,代理可以学习演示路径以及如何回到这些路径上。通过最佳或近乎最佳的示范,学习的策略将对偏差既最佳又有牢固的吸引力。在连续控制域,我们从演示数据中未见的不同初始状态开始评估鲁棒性。尽管我们的方法和其他模仿学习基线都可以成功地解决训练分布中初始状态的任务,但我们的方法对不同初始状态表现出更大的鲁棒性。

Behavior cloning of expert demonstrations can speed up learning optimal policies in a more sample-efficient way over reinforcement learning. However, the policy cannot extrapolate well to unseen states outside of the demonstration data, creating covariate shift (agent drifting away from demonstrations) and compounding errors. In this work, we tackle this issue by extending the region of attraction around the demonstrations so that the agent can learn how to get back onto the demonstrated trajectories if it veers off-course. We train a generative backwards dynamics model and generate short imagined trajectories from states in the demonstrations. By imitating both demonstrations and these model rollouts, the agent learns the demonstrated paths and how to get back onto these paths. With optimal or near-optimal demonstrations, the learned policy will be both optimal and robust to deviations, with a wider region of attraction. On continuous control domains, we evaluate the robustness when starting from different initial states unseen in the demonstration data. While both our method and other imitation learning baselines can successfully solve the tasks for initial states in the training distribution, our method exhibits considerably more robustness to different initial states.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源