基于贪婪的价值表示，用于多代理增强学习中的最佳协调

论文标题

基于贪婪的价值表示，用于多代理增强学习中的最佳协调

Greedy based Value Representation for Optimal Coordination in Multi-agent Reinforcement Learning

论文作者

Wan, Lipeng, Liu, Zeyang, Chen, Xingyu, Lan, Xuguang, Zheng, Nanning

论文摘要

由于表示Q值函数的表示限制，因此具有线性值分解（LVD）或单调值分解（MVD）的多代理增强学习方法遭受相对过度趋于范围的相对过度。结果，它们无法确保最佳的一致性（即单个贪婪动作与最大真实Q值之间的对应关系）。在本文中，我们得出了LVD和MVD的关节Q值函数的表达。根据该表达式，我们绘制一个过渡图，其中每个自我转变节点（STN）都是可能的收敛。为了确保最佳的一致性，最佳节点必须为唯一的STN。因此，我们提出了基于贪婪的价值表示（GVR），该表示通过较低的目标塑造将最佳节点变成STN，并通过出色的体验重播进一步消除了非最佳STN。此外，GVR在最优性和稳定性之间实现了自适应权衡。在各种基准测试的实验中，我们的方法优于最先进的基线。矩阵游戏的理论证明和经验结果表明，GVR确保在足够的探索下确保最佳一致性。

Due to the representation limitation of the joint Q value function, multi-agent reinforcement learning methods with linear value decomposition (LVD) or monotonic value decomposition (MVD) suffer from relative overgeneralization. As a result, they can not ensure optimal consistency (i.e., the correspondence between individual greedy actions and the maximal true Q value). In this paper, we derive the expression of the joint Q value function of LVD and MVD. According to the expression, we draw a transition diagram, where each self-transition node (STN) is a possible convergence. To ensure optimal consistency, the optimal node is required to be the unique STN. Therefore, we propose the greedy-based value representation (GVR), which turns the optimal node into an STN via inferior target shaping and further eliminates the non-optimal STNs via superior experience replay. In addition, GVR achieves an adaptive trade-off between optimality and stability. Our method outperforms state-of-the-art baselines in experiments on various benchmarks. Theoretical proofs and empirical results on matrix games demonstrate that GVR ensures optimal consistency under sufficient exploration.

下载PDF全文

下载文献需遵守相关版权规定

论文标题