具有自我预测性表示的数据有效的增强学习

论文标题

具有自我预测性表示的数据有效的增强学习

Data-Efficient Reinforcement Learning with Self-Predictive Representations

论文作者

Schwarzer, Max, Anand, Ankesh, Goel, Rishab, Hjelm, R Devon, Courville, Aaron, Bachman, Philip

论文摘要

虽然深度强化学习擅长解决可以通过与环境无限互动来收集大量数据的任务，但从有限互动中学习仍然是一个关键挑战。我们认为，如果我们根据其视觉输入和与环境的顺序相互作用在结构中以自我监督的目标来增强奖励最大化，则代理可以更有效地学习奖励。我们的方法，即自我预测的表示（SPR），训练一个代理，以预测其潜在状态表示未来的多个步骤。我们使用编码器来计算未来状态的目标表示，这是代理参数的指数移动平均值，我们使用学习的过渡模型进行预测。本身，这个未来的预测目标优于先前的样品效率深度RL的方法。我们通过将数据增强添加到未来的预测损失中，进一步提高了性能，这迫使代理人的表示在观察的多种视图中保持一致。我们完整的自我监督目标结合了未来的预测和数据的增强，在ATARI上的人均衡得分中位数为0.415，在一个限制在100k环境相互作用的环境中，这代表了与先前的最新目的相对的55％相对改善。值得注意的是，即使在这种有限的数据制度中，SPR也超过了26场比赛中7个专家人类得分。与此工作相关的代码可在https://github.com/mila-iqia/spr上获得

While deep reinforcement learning excels at solving tasks where large amounts of data can be collected through virtually unlimited interaction with the environment, learning from limited interaction remains a key challenge. We posit that an agent can learn more efficiently if we augment reward maximization with self-supervised objectives based on structure in its visual input and sequential interaction with the environment. Our method, Self-Predictive Representations(SPR), trains an agent to predict its own latent state representations multiple steps into the future. We compute target representations for future states using an encoder which is an exponential moving average of the agent's parameters and we make predictions using a learned transition model. On its own, this future prediction objective outperforms prior methods for sample-efficient deep RL from pixels. We further improve performance by adding data augmentation to the future prediction loss, which forces the agent's representations to be consistent across multiple views of an observation. Our full self-supervised objective, which combines future prediction and data augmentation, achieves a median human-normalized score of 0.415 on Atari in a setting limited to 100k steps of environment interaction, which represents a 55% relative improvement over the previous state-of-the-art. Notably, even in this limited data regime, SPR exceeds expert human scores on 7 out of 26 games. The code associated with this work is available at https://github.com/mila-iqia/spr

下载PDF全文

下载文献需遵守相关版权规定

论文标题