阿尔玛：复合多代理任务的分层学习

论文标题

阿尔玛：复合多代理任务的分层学习

ALMA: Hierarchical Learning for Composite Multi-Agent Tasks

论文作者

Iqbal, Shariq, Costales, Robby, Sha, Fei

论文摘要

尽管近年来在多机构增强学习（MARL）方面取得了重大进展，但复杂领域的协调仍然是一个挑战。 MARL的工作通常专注于解决代理与环境中所有其他代理和实体互动的任务；但是，我们观察到现实世界任务通常由几个局部代理相互作用（子任务）的几个孤立实例组成，并且每个代理都可以有意义地专注于一个子任务，以排除环境中其他所有内容。在这些综合任务中，成功的策略通常可以分解为两个决策级别：代理人分配给特定的子任务，每个代理人仅针对其指定的子任务就可以有效地采取行动。这种分解的决策提供了强烈的结构感应偏见，大大降低了代理观察空间，并鼓励在训练期间重复使用和组成子任务特异性策略，而不是将子任务的每个新组成视为独特的策略。我们介绍了ALMA，这是一种利用这些结构化任务的一般学习方法。 Alma同时学习了高级子任务分配策略和低级代理政策。我们证明，阿尔玛（Alma）在许多具有挑战性的环境中学习了复杂的协调行为，表现优于强大的基准。 Alma的模块化还使其能够更好地概括为新的环境配置。最后，我们发现，尽管ALMA可以整合受过训练的分配和行动策略，但最佳性能仅通过共同训练所有组件才能获得。我们的代码可从https://github.com/shariqiqbal2810/alma获得

Despite significant progress on multi-agent reinforcement learning (MARL) in recent years, coordination in complex domains remains a challenge. Work in MARL often focuses on solving tasks where agents interact with all other agents and entities in the environment; however, we observe that real-world tasks are often composed of several isolated instances of local agent interactions (subtasks), and each agent can meaningfully focus on one subtask to the exclusion of all else in the environment. In these composite tasks, successful policies can often be decomposed into two levels of decision-making: agents are allocated to specific subtasks and each agent acts productively towards their assigned subtask alone. This decomposed decision making provides a strong structural inductive bias, significantly reduces agent observation spaces, and encourages subtask-specific policies to be reused and composed during training, as opposed to treating each new composition of subtasks as unique. We introduce ALMA, a general learning method for taking advantage of these structured tasks. ALMA simultaneously learns a high-level subtask allocation policy and low-level agent policies. We demonstrate that ALMA learns sophisticated coordination behavior in a number of challenging environments, outperforming strong baselines. ALMA's modularity also enables it to better generalize to new environment configurations. Finally, we find that while ALMA can integrate separately trained allocation and action policies, the best performance is obtained only by training all components jointly. Our code is available at https://github.com/shariqiqbal2810/ALMA

下载PDF全文

下载文献需遵守相关版权规定

论文标题