从示范中学习的自我象征

论文标题

从示范中学习的自我象征

Self-Imitation Learning from Demonstrations

论文作者

Pshikhachev, Georgiy, Ivanov, Dmitry, Egorov, Vladimir, Shpilman, Aleksei

论文摘要

尽管通过加强学习（RL）取得了许多突破，但稀疏奖励的解决环境仍然是一项具有挑战性的任务，需要复杂的探索。通过指导代理商对专家经历的国家的探索，从示范措施（LFD）学习这个问题。自然，这种方法取决于示范质量的好处，在现实情况下，这些示威质量很少。现代的LFD算法需要对控制示范影响的超参数进行细致的调整，并正如我们在论文中所显示的那样，与从次优示范中学习的斗争。为了解决这些问题，我们将自我图像学习（SIL）扩展到了最近的RL算法，该算法将代理商过去的良好体验利用了LFD设置，通过通过演示初始化其重播缓冲液来设置。我们将我们的算法表示为SIL来自示范（SILFD）。我们从经验上表明，SILFD可以从嘈杂或远离最佳的演示中学习，并且可以在整个培训中自动调整演示的影响，而无需其他超参数或手工制作的时间表。我们还发现，在稀疏环境中，SILFD优于现有的最新LFD算法，尤其是在演示高度次优时。

Despite the numerous breakthroughs achieved with Reinforcement Learning (RL), solving environments with sparse rewards remains a challenging task that requires sophisticated exploration. Learning from Demonstrations (LfD) remedies this issue by guiding the agent's exploration towards states experienced by an expert. Naturally, the benefits of this approach hinge on the quality of demonstrations, which are rarely optimal in realistic scenarios. Modern LfD algorithms require meticulous tuning of hyperparameters that control the influence of demonstrations and, as we show in the paper, struggle with learning from suboptimal demonstrations. To address these issues, we extend Self-Imitation Learning (SIL), a recent RL algorithm that exploits the agent's past good experience, to the LfD setup by initializing its replay buffer with demonstrations. We denote our algorithm as SIL from Demonstrations (SILfD). We empirically show that SILfD can learn from demonstrations that are noisy or far from optimal and can automatically adjust the influence of demonstrations throughout the training without additional hyperparameters or handcrafted schedules. We also find SILfD superior to the existing state-of-the-art LfD algorithms in sparse environments, especially when demonstrations are highly suboptimal.

下载PDF全文

下载文献需遵守相关版权规定

论文标题