论文标题
通过平行微分模拟加速政策学习
Accelerated Policy Learning with Parallel Differentiable Simulation
论文作者
论文摘要
深度强化学习可以产生复杂的控制政策,但需要大量的培训数据才能有效工作。最近的工作试图通过利用可区分的模拟器来解决这个问题。但是,诸如局部最小值和爆炸/消失的数值梯度之类的固有问题阻止这些方法通常被应用于具有复杂接触型动力学的控制任务,例如经典RL基准中的人形运动。在这项工作中,我们提出了一个高性能的可区分模拟器和新的政策学习算法(SHAC),即使在不平滑的情况下,也可以有效利用模拟梯度。我们的学习算法通过流畅的批评功能来减轻本地最小值的问题,避免通过截短的学习窗口消失/爆炸梯度,并允许许多物理环境并行运行。我们对经典RL控制任务进行了评估,并在最先进的RL和基于可区分的仿真算法上显示出样本效率和墙壁锁定时间的实质性提高。此外,我们通过将方法应用于具有较大动作空间的肌肉施加运动的挑战性高维问题,从而证明了我们的方法的可伸缩性,从而在训练时间降低了17倍以上,而不是表现最佳的RL算法。
Deep reinforcement learning can generate complex control policies, but requires large amounts of training data to work effectively. Recent work has attempted to address this issue by leveraging differentiable simulators. However, inherent problems such as local minima and exploding/vanishing numerical gradients prevent these methods from being generally applied to control tasks with complex contact-rich dynamics, such as humanoid locomotion in classical RL benchmarks. In this work we present a high-performance differentiable simulator and a new policy learning algorithm (SHAC) that can effectively leverage simulation gradients, even in the presence of non-smoothness. Our learning algorithm alleviates problems with local minima through a smooth critic function, avoids vanishing/exploding gradients through a truncated learning window, and allows many physical environments to be run in parallel. We evaluate our method on classical RL control tasks, and show substantial improvements in sample efficiency and wall-clock time over state-of-the-art RL and differentiable simulation-based algorithms. In addition, we demonstrate the scalability of our method by applying it to the challenging high-dimensional problem of muscle-actuated locomotion with a large action space, achieving a greater than 17x reduction in training time over the best-performing established RL algorithm.