零和马尔可夫游戏的平滑政策迭代

论文标题

零和马尔可夫游戏的平滑政策迭代

Smoothing Policy Iteration for Zero-sum Markov Games

论文作者

Ren, Yangang, Lyu, Yao, Wang, Wenxuan, Li, Shengbo Eben, Li, Zeyang, Duan, Jingliang

论文摘要

零和马尔可夫游戏（MGS）一直是多代理系统和稳健控制的有效框架，其中构建了最小问题以解决平衡策略。目前，该公式在表格设置下进行了充分的研究，其中最大运算符主要是为了计算最差的案例值函数。但是，扩展此类方法以处理复杂的任务是不平凡的，因为在大规模的动作空间上找到最大值通常很麻烦。在本文中，我们提出了平滑策略迭代（SPI）算法以大约求解零和MGS，其中最大运算符被加权logsumexp（WLSE）函数代替，以获得几乎最佳的平衡策略。特别是，对抗性策略被用作重量函数，以实现对动作空间的有效抽样。我们还证明了SPI的收敛性并根据收缩映射定理分析了其$ \ infty -$ norm中的近似误差。此外，我们提出了一种基于模型的算法，称为“光滑对抗性参与者 - 批评者（SAAC）”，通过扩展函数近似值SPI。通过采样轨迹评估与WLSE函数相关的目标值，然后构建均值误差以优化值函数，并采用了梯度散发的方法，以共同优化主角和对抗性策略。此外，我们将重新聚集技术纳入基于模型的梯度后传播中，以防止由于随机策略的采样而导致的梯度消失。我们在表格和函数近似设置中验证算法。结果表明，SPI可以以高精度近似最差的案例值函数，而SAAC可以稳定训练过程，并在很大的边距中提高对抗性鲁棒性。

Zero-sum Markov Games (MGs) has been an efficient framework for multi-agent systems and robust control, wherein a minimax problem is constructed to solve the equilibrium policies. At present, this formulation is well studied under tabular settings wherein the maximum operator is primarily and exactly solved to calculate the worst-case value function. However, it is non-trivial to extend such methods to handle complex tasks, as finding the maximum over large-scale action spaces is usually cumbersome. In this paper, we propose the smoothing policy iteration (SPI) algorithm to solve the zero-sum MGs approximately, where the maximum operator is replaced by the weighted LogSumExp (WLSE) function to obtain the nearly optimal equilibrium policies. Specially, the adversarial policy is served as the weight function to enable an efficient sampling over action spaces.We also prove the convergence of SPI and analyze its approximation error in $\infty -$norm based on the contraction mapping theorem. Besides, we propose a model-based algorithm called Smooth adversarial Actor-critic (SaAC) by extending SPI with the function approximations. The target value related to WLSE function is evaluated by the sampled trajectories and then mean square error is constructed to optimize the value function, and the gradient-ascent-descent methods are adopted to optimize the protagonist and adversarial policies jointly. In addition, we incorporate the reparameterization technique in model-based gradient back-propagation to prevent the gradient vanishing due to sampling from the stochastic policies. We verify our algorithm in both tabular and function approximation settings. Results show that SPI can approximate the worst-case value function with a high accuracy and SaAC can stabilize the training process and improve the adversarial robustness in a large margin.

下载PDF全文

下载文献需遵守相关版权规定

论文标题