论文标题
影响多种强化学习中的长期行为
Influencing Long-Term Behavior in Multiagent Reinforcement Learning
论文作者
论文摘要
多基础强化学习的主要挑战是在存在其他同时学习的代理人的情况下学习有用的政策的困难,这些人的行为变化会共同影响环境的过渡和奖励动态。最近出现了一种解决这种非平稳性的有效方法是,每个代理商都可以预料到其他代理的学习,并影响未来政策对理想行为的进化,以自身的利益。不幸的是,以前的实现这一目标的方法仅考虑了有限数量的政策更新。因此,这些方法只能影响瞬态的未来政策,而不是实现影响融合行为的可扩展平衡选择方法的承诺。在本文中,我们提出了一个原则上的框架,以考虑随着时间的流逝而考虑其他代理的限制政策。具体而言,我们开发了一个新的优化目标,该目标通过直接考虑其行为对其他代理人会融合的限制政策集的影响来最大化每个代理的平均奖励。我们的论文表征了此问题设置中理想的解决方案概念,并提供了优化可能结果的实用方法。由于我们具有远视目标,我们比在各种多种基准基准域中的最先进的基线表现出更好的长期性能。
The main challenge of multiagent reinforcement learning is the difficulty of learning useful policies in the presence of other simultaneously learning agents whose changing behaviors jointly affect the environment's transition and reward dynamics. An effective approach that has recently emerged for addressing this non-stationarity is for each agent to anticipate the learning of other agents and influence the evolution of future policies towards desirable behavior for its own benefit. Unfortunately, previous approaches for achieving this suffer from myopic evaluation, considering only a finite number of policy updates. As such, these methods can only influence transient future policies rather than achieving the promise of scalable equilibrium selection approaches that influence the behavior at convergence. In this paper, we propose a principled framework for considering the limiting policies of other agents as time approaches infinity. Specifically, we develop a new optimization objective that maximizes each agent's average reward by directly accounting for the impact of its behavior on the limiting set of policies that other agents will converge to. Our paper characterizes desirable solution concepts within this problem setting and provides practical approaches for optimizing over possible outcomes. As a result of our farsighted objective, we demonstrate better long-term performance than state-of-the-art baselines across a suite of diverse multiagent benchmark domains.