论文标题
零和马尔可夫游戏的自我播放后抽样算法
A Self-Play Posterior Sampling Algorithm for Zero-Sum Markov Games
论文作者
论文摘要
现有关于马尔可夫游戏(MGS)可证明有效算法的研究几乎完全基于“面对不确定性的乐观”(OFU)原则。这项工作着重于一种不同的后验采样方法,后者在许多土匪和强化学习环境中庆祝,但对于MGS仍未探索。具体而言,对于发作的两人零和MGS,开发了一种具有一般函数近似的新型后取样算法。理论分析表明,后采样算法承认$ \ sqrt {t} $ - 遗憾的是,由于低多机构解耦系数的问题而遗憾,这是MGS的新复杂度量度,其中$ t $表示插座的数量。当专门使用线性MG时,获得的后悔界限与最新结果相匹配。据我们所知,这是具有频繁遗憾保证的MGS的第一个可证明有效的后验抽样算法,它丰富了MGS工具箱并促进了后验采样的广泛适用性。
Existing studies on provably efficient algorithms for Markov games (MGs) almost exclusively build on the "optimism in the face of uncertainty" (OFU) principle. This work focuses on a different approach of posterior sampling, which is celebrated in many bandits and reinforcement learning settings but remains under-explored for MGs. Specifically, for episodic two-player zero-sum MGs, a novel posterior sampling algorithm is developed with general function approximation. Theoretical analysis demonstrates that the posterior sampling algorithm admits a $\sqrt{T}$-regret bound for problems with a low multi-agent decoupling coefficient, which is a new complexity measure for MGs, where $T$ denotes the number of episodes. When specialized to linear MGs, the obtained regret bound matches the state-of-the-art results. To the best of our knowledge, this is the first provably efficient posterior sampling algorithm for MGs with frequentist regret guarantees, which enriches the toolbox for MGs and promotes the broad applicability of posterior sampling.