论文标题

基于贪婪的价值表示,用于多代理增强学习中的最佳协调

Greedy based Value Representation for Optimal Coordination in Multi-agent Reinforcement Learning

论文作者

Wan, Lipeng, Liu, Zeyang, Chen, Xingyu, Lan, Xuguang, Zheng, Nanning

论文摘要

由于表示Q值函数的表示限制,因此具有线性值分解(LVD)或单调值分解(MVD)的多代理增强学习方法遭受相对过度趋于范围的相对过度。结果,它们无法确保最佳的一致性(即单个贪婪动作与最大真实Q值之间的对应关系)。在本文中,我们得出了LVD和MVD的关节Q值函数的表达。根据该表达式,我们绘制一个过渡图,其中每个自我转变节点(STN)都是可能的收敛。为了确保最佳的一致性,最佳节点必须为唯一的STN。因此,我们提出了基于贪婪的价值表示(GVR),该表示通过较低的目标塑造将最佳节点变成STN,并通过出色的体验重播进一步消除了非最佳STN。此外,GVR在最优性和稳定性之间实现了自适应权衡。在各种基准测试的实验中,我们的方法优于最先进的基线。矩阵游戏的理论证明和经验结果表明,GVR确保在足够的探索下确保最佳一致性。

Due to the representation limitation of the joint Q value function, multi-agent reinforcement learning methods with linear value decomposition (LVD) or monotonic value decomposition (MVD) suffer from relative overgeneralization. As a result, they can not ensure optimal consistency (i.e., the correspondence between individual greedy actions and the maximal true Q value). In this paper, we derive the expression of the joint Q value function of LVD and MVD. According to the expression, we draw a transition diagram, where each self-transition node (STN) is a possible convergence. To ensure optimal consistency, the optimal node is required to be the unique STN. Therefore, we propose the greedy-based value representation (GVR), which turns the optimal node into an STN via inferior target shaping and further eliminates the non-optimal STNs via superior experience replay. In addition, GVR achieves an adaptive trade-off between optimality and stability. Our method outperforms state-of-the-art baselines in experiments on various benchmarks. Theoretical proofs and empirical results on matrix games demonstrate that GVR ensures optimal consistency under sufficient exploration.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源