通过通才特性主义学习改善政策优化

论文标题

通过通才特性主义学习改善政策优化

Improving Policy Optimization with Generalist-Specialist Learning

论文作者

Jia, Zhiwei, Li, Xuanlin, Ling, Zhan, Liu, Shuang, Wu, Yiran, Su, Hao

论文摘要

对未见环境变化的深入强化学习的概括通常需要对大量各种培训变化进行政策学习。我们从经验上观察到，接受过许多变化的代理商（通才）倾向于在一开始就更快地学习，但是长期以来其最佳水平的性能高原。相比之下，在有限的计算预算下，仅接受几种变体（专家）培训的代理通常可以获得高回报。为了两全其美，我们提出了一个新颖的通才特性训练框架。具体来说，我们首先培训一名通才的所有环境变化。当它无法改善时，我们会启动大量的专家，并从通才克隆过重量，每个人都接受了训练，以掌握选定的一小部分变化子集。我们终于通过所有专家的示范引起的辅助奖励恢复了通才的培训。特别是，我们调查了开始专业培训的时机，并在专家的协助下比较学习通才的策略。我们表明，该框架将政策学习的信封推向了包括Procgen，Meta-World和Maniskill在内的几个具有挑战性和流行的基准。

Generalization in deep reinforcement learning over unseen environment variations usually requires policy learning over a large set of diverse training variations. We empirically observe that an agent trained on many variations (a generalist) tends to learn faster at the beginning, yet its performance plateaus at a less optimal level for a long time. In contrast, an agent trained only on a few variations (a specialist) can often achieve high returns under a limited computational budget. To have the best of both worlds, we propose a novel generalist-specialist training framework. Specifically, we first train a generalist on all environment variations; when it fails to improve, we launch a large population of specialists with weights cloned from the generalist, each trained to master a selected small subset of variations. We finally resume the training of the generalist with auxiliary rewards induced by demonstrations of all specialists. In particular, we investigate the timing to start specialist training and compare strategies to learn generalists with assistance from specialists. We show that this framework pushes the envelope of policy learning on several challenging and popular benchmarks including Procgen, Meta-World and ManiSkill.

下载PDF全文

下载文献需遵守相关版权规定

论文标题