通过降低估算器差异来驯服多代理增强学习

论文标题

通过降低估算器差异来驯服多代理增强学习

Taming Multi-Agent Reinforcement Learning with Estimator Variance Reduction

论文作者

Jafferjee, Taher, Ziomek, Juliusz, Yang, Tianpei, Dai, Zipeng, Wang, Jianhong, Taylor, Matthew, Shao, Kun, Wang, Jun, Mguni, David

论文摘要

通过分散执行（CT-DE）的集中培训是许多领先的多代理增强学习（MARL）算法的基础。尽管它很受欢迎，但由于依赖在给定状态下的联合行动样本中学习的依赖，它仍然存在关键的缺点。当代理商在培训期间探索和更新其政策时，这些单个样本可能很差代表代理系统的实际关节政策，从而导致较高的差异梯度估计阻碍了学习。为了解决这个问题，我们提出了一种适合任何参与者批评的MARL方法的增强工具。我们的框架，增强强化学习设备（PERLA）的框架，在代理商训练时将代理商的联合政策对批评者的联合政策引入了采样技术。这导致TD更新在当前的联合政策下与真实的期望值紧密近似，而不是从给定状态下单个联合行动样本中的估计值。这会产生较低的差异和预期收益的精确估计，从而最大程度地减少了通常阻碍学习的评论家估计量的差异。此外，正如我们所证明的那样，通过消除联合政策的单个采样中的许多评论家差异，Perla使CT-DE方法能够通过代理数量更有效地扩展。从理论上讲，我们证明，Perla可以减少与分散培训相似的价值估计的差异，同时保持集中式培训的好处。从经验上讲，我们证明了Perla在包括多代理Mujoco和Starcraft II多代理挑战等一系列基准中减少估计差异的出色表现和能力。

Centralised training with decentralised execution (CT-DE) serves as the foundation of many leading multi-agent reinforcement learning (MARL) algorithms. Despite its popularity, it suffers from a critical drawback due to its reliance on learning from a single sample of the joint-action at a given state. As agents explore and update their policies during training, these single samples may poorly represent the actual joint-policy of the system of agents leading to high variance gradient estimates that hinder learning. To address this problem, we propose an enhancement tool that accommodates any actor-critic MARL method. Our framework, Performance Enhancing Reinforcement Learning Apparatus (PERLA), introduces a sampling technique of the agents' joint-policy into the critics while the agents train. This leads to TD updates that closely approximate the true expected value under the current joint-policy rather than estimates from a single sample of the joint-action at a given state. This produces low variance and precise estimates of expected returns, minimising the variance in the critic estimators which typically hinders learning. Moreover, as we demonstrate, by eliminating much of the critic variance from the single sampling of the joint policy, PERLA enables CT-DE methods to scale more efficiently with the number of agents. Theoretically, we prove that PERLA reduces variance in value estimates similar to that of decentralised training while maintaining the benefits of centralised training. Empirically, we demonstrate PERLA's superior performance and ability to reduce estimator variance in a range of benchmarks including Multi-agent Mujoco, and StarCraft II Multi-agent Challenge.

下载PDF全文

下载文献需遵守相关版权规定

论文标题