可扩展的强化学习政策用于多代理控制

论文标题

可扩展的强化学习政策用于多代理控制

Scalable Reinforcement Learning Policies for Multi-Agent Control

论文作者

Hsu, Christopher D., Jeong, Heejin, Pappas, George J., Chaudhari, Pratik

论文摘要

我们开发了一种多代理增强学习（MARL）方法来学习目标跟踪的可扩展控制策略。我们的方法可以处理任意数量的追随者和目标；我们显示了最多1000名追随者跟踪1000个目标的任务的结果。我们使用一个分散的，部分观察的马尔可夫决策过程框架来对追随者进行建模，以作为接收部分观察（范围和轴承）的代理，以使用固定的，未知的策略进行移动的目标。注意机制用于参数化代理的价值函数。这种机制使我们能够处理任意数量的目标。熵登记的非政策RL方法用于训练随机政策，我们讨论了如何在追随者之间实现对冲行为，尽管执行了完全分散的控制执行，但仍会导致薄弱的合作形式。我们进一步开发了一种掩盖启发式方法，该启发式方法允许对较小的追求者目标进行培训，并在更大的问题上执行执行。进行了彻底的仿真实验，消融研究以及与最先进的算法的比较，以研究方法的可伸缩性和对不同数量的代理和目标的稳健性。

We develop a Multi-Agent Reinforcement Learning (MARL) method to learn scalable control policies for target tracking. Our method can handle an arbitrary number of pursuers and targets; we show results for tasks consisting up to 1000 pursuers tracking 1000 targets. We use a decentralized, partially-observable Markov Decision Process framework to model pursuers as agents receiving partial observations (range and bearing) about targets which move using fixed, unknown policies. An attention mechanism is used to parameterize the value function of the agents; this mechanism allows us to handle an arbitrary number of targets. Entropy-regularized off-policy RL methods are used to train a stochastic policy, and we discuss how it enables a hedging behavior between pursuers that leads to a weak form of cooperation in spite of completely decentralized control execution. We further develop a masking heuristic that allows training on smaller problems with few pursuers-targets and execution on much larger problems. Thorough simulation experiments, ablation studies, and comparisons to state of the art algorithms are performed to study the scalability of the approach and robustness of performance to varying numbers of agents and targets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题