时间抽象在时间对比度学习中提出：RL中拉普拉斯的替代品

论文标题

时间抽象在时间对比度学习中提出：RL中拉普拉斯的替代品

Temporal Abstractions-Augmented Temporally Contrastive Learning: An Alternative to the Laplacian in RL

论文作者

Erraqabi, Akram, Machado, Marlos C., Zhao, Mingde, Sukhbaatar, Sainbayar, Lazaric, Alessandro, Denoyer, Ludovic, Bengio, Yoshua

论文摘要

在强化学习中，Laplacian的图被证明是任务不合时宜的设置中的宝贵工具，其应用程序从技能发现到奖励成型。最近，学习Laplacian表示形式已被构架，以优化时间对抗性目标，以克服其在较大（或连续）状态空间中的计算限制。但是，这种方法需要统一地访问状态空间中的所有州，从而忽略表示过程中出现的勘探问题。在这项工作中，我们提出了一种替代方法，该方法能够在非均匀优先设置中恢复Laplacian表示的表达性和所需属性。我们这样做是通过将表示学习与基于技能的涵盖政策相结合，该政策提供了更好的培训分配以扩展和完善表示形式。我们还表明，通过学习的时间抽象对表示目标的简单增强可以提高动态意识并有助于探索。我们发现，我们的方法在不均匀的环境中成功地成为拉普拉斯人的替代方案，并缩放了具有挑战性的连续控制环境。最后，即使我们的方法没有针对技能发现进行优化，学到的技能也可以成功地解决稀疏奖励的困难连续导航任务，而标准技能发现方法并不那么有效。

In reinforcement learning, the graph Laplacian has proved to be a valuable tool in the task-agnostic setting, with applications ranging from skill discovery to reward shaping. Recently, learning the Laplacian representation has been framed as the optimization of a temporally-contrastive objective to overcome its computational limitations in large (or continuous) state spaces. However, this approach requires uniform access to all states in the state space, overlooking the exploration problem that emerges during the representation learning process. In this work, we propose an alternative method that is able to recover, in a non-uniform-prior setting, the expressiveness and the desired properties of the Laplacian representation. We do so by combining the representation learning with a skill-based covering policy, which provides a better training distribution to extend and refine the representation. We also show that a simple augmentation of the representation objective with the learned temporal abstractions improves dynamics-awareness and helps exploration. We find that our method succeeds as an alternative to the Laplacian in the non-uniform setting and scales to challenging continuous control environments. Finally, even if our method is not optimized for skill discovery, the learned skills can successfully solve difficult continuous navigation tasks with sparse rewards, where standard skill discovery approaches are no so effective.

下载PDF全文

下载文献需遵守相关版权规定

论文标题