优先级重播

论文标题

Prioritized Level Replay

论文作者

Jiang, Minqi, Grefenstette, Edward, Rocktäschel, Tim

论文摘要

具有程序生成的内容的环境是测试深度强化学习中系统概括的重要基准。在这种情况下，每个级别是一个算法创建的环境实例，具有其变异因素的唯一配置。对预先指定的子集的培训可以测试概括至看不见的水平。可以从一个级别中学到的内容取决于当前的政策，但先前的工作默认是统一的培训水平采样，独立于政策。我们介绍了优先的水平重播（PLR），这是一个通用框架，用于通过对未来重新审视具有较高估计学习潜力的人的优先级来选择性采样下一个培训水平。我们表明，TD-Errors有效地估计了A Level的未来学习潜力，并在指导采样程序时会引起越来越困难的水平的新兴课程。通过调整训练水平的采样，PLR可以显着提高样品效率和Procgen基准的概括 - 在测试回报中匹配了先前的最新时间 - 并且很容易与其他方法结合在一起。与先前的领先方法相结合，相对于标准的RL基准，PLR将最新方法提高了测试收益率的76％以上。

Environments with procedurally generated content serve as important benchmarks for testing systematic generalization in deep reinforcement learning. In this setting, each level is an algorithmically created environment instance with a unique configuration of its factors of variation. Training on a prespecified subset of levels allows for testing generalization to unseen levels. What can be learned from a level depends on the current policy, yet prior work defaults to uniform sampling of training levels independently of the policy. We introduce Prioritized Level Replay (PLR), a general framework for selectively sampling the next training level by prioritizing those with higher estimated learning potential when revisited in the future. We show TD-errors effectively estimate a level's future learning potential and, when used to guide the sampling procedure, induce an emergent curriculum of increasingly difficult levels. By adapting the sampling of training levels, PLR significantly improves sample efficiency and generalization on Procgen Benchmark--matching the previous state-of-the-art in test return--and readily combines with other methods. Combined with the previous leading method, PLR raises the state-of-the-art to over 76% improvement in test return relative to standard RL baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题