论文标题
关于马尔可夫潜在游戏中分散的软马克斯梯度玩游戏的全球收敛速率
On the Global Convergence Rates of Decentralized Softmax Gradient Play in Markov Potential Games
论文作者
论文摘要
SoftMax策略梯度是单格强化学习中策略优化的流行算法,尤其是因为每个梯度更新不需要投影。但是,在多机构系统中,缺乏中央协调在收敛分析中引入了重大困难。即使对于具有相同兴趣的随机游戏,也可以有多个NASH Equilibria(NES),该游戏可以禁用依赖于独特的全局最佳最佳的证明技术。此外,SoftMax参数化引入了零梯度的非NEN策略,因此基于梯度的算法在寻求NES时很难。在本文中,我们研究了以一种特殊的游戏形式的Markov潜在游戏(MPGS)来研究分散的SoftMax梯度游戏的有限时间融合,其中包括相同的兴趣游戏作为一种特殊情况。我们研究了梯度游戏和自然梯度游戏,有或没有$ \ log $ barrier正规化。未注册情况的既定融合率都包含一个轨迹依赖性常数,该常数可以任意大,而$ \ log $ barrier的正则化克服了这一缺点,其成本略有依赖其他因素,例如操作集大小。一项关于相同兴趣矩阵游戏的实证研究证实了理论发现。
Softmax policy gradient is a popular algorithm for policy optimization in single-agent reinforcement learning, particularly since projection is not needed for each gradient update. However, in multi-agent systems, the lack of central coordination introduces significant additional difficulties in the convergence analysis. Even for a stochastic game with identical interest, there can be multiple Nash Equilibria (NEs), which disables proof techniques that rely on the existence of a unique global optimum. Moreover, the softmax parameterization introduces non-NE policies with zero gradient, making it difficult for gradient-based algorithms in seeking NEs. In this paper, we study the finite time convergence of decentralized softmax gradient play in a special form of game, Markov Potential Games (MPGs), which includes the identical interest game as a special case. We investigate both gradient play and natural gradient play, with and without $\log$-barrier regularization. The established convergence rates for the unregularized cases contain a trajectory-dependent constant that can be arbitrarily large, whereas the $\log$-barrier regularization overcomes this drawback, with the cost of slightly worse dependence on other factors such as the action set size. An empirical study on an identical interest matrix game confirms the theoretical findings.