POMDP的多重推广和政策迭代，并应用于多机器人修复问题

论文标题

POMDP的多重推广和政策迭代，并应用于多机器人修复问题

Multiagent Rollout and Policy Iteration for POMDP with Application to Multi-Robot Repair Problems

论文作者

Bhattacharya, Sushmita, Kailas, Siva, Badyal, Sahil, Gil, Stephanie, Bertsekas, Dimitri

论文摘要

在本文中，我们考虑了无限的地平线折扣动态编程问题，该问题具有有限的状态和控制空间，部分状态观测和多种结构。我们讨论和比较算法，这些算法通过使用Multistep LookAhead，带有已知基本策略的截断推出以及终端成本函数近似来同时或顺序优化代理的控件。我们的方法特别解决了部分可观察到的多基因问题的计算挑战。特别是：1）我们考虑大幅度减少所需计算的推出算法，同时保留标准推出方法的关键成本提高属性。与标准推出的$ O（C^M）$相比，我们方法的每步计算要求为$ O（CM）$，其中$ c $是每个代理的控制组件的约束设置的最大基数，而$ m $是代理的数量。 2）我们证明我们的方法可以应用于图形结构的挑战性问题，包括一类机器人维修问题，多个机器人在部分信息下协作检查和维修系统。 3）我们提供了一项模拟研究，将我们的方法与现有方法进行比较，并证明我们的方法可以处理更大且更复杂的部分可观察到的多基因问题（状态空间尺寸$ 10^{37} $和控制空间尺寸$ 10^7} $）。最后，我们将多种推出算法纳入近似策略迭代方案中，其中连续的推出策略是通过使用神经网络分类器近似的。尽管该方案需要严格的离线实现，但它在我们的计算实验中效果很好，并且对单个在线推出迭代方法产生了更大的性能改进。

In this paper we consider infinite horizon discounted dynamic programming problems with finite state and control spaces, partial state observations, and a multiagent structure. We discuss and compare algorithms that simultaneously or sequentially optimize the agents' controls by using multistep lookahead, truncated rollout with a known base policy, and a terminal cost function approximation. Our methods specifically address the computational challenges of partially observable multiagent problems. In particular: 1) We consider rollout algorithms that dramatically reduce required computation while preserving the key cost improvement property of the standard rollout method. The per-step computational requirements for our methods are on the order of $O(Cm)$ as compared with $O(C^m)$ for standard rollout, where $C$ is the maximum cardinality of the constraint set for the control component of each agent, and $m$ is the number of agents. 2) We show that our methods can be applied to challenging problems with a graph structure, including a class of robot repair problems whereby multiple robots collaboratively inspect and repair a system under partial information. 3) We provide a simulation study that compares our methods with existing methods, and demonstrate that our methods can handle larger and more complex partially observable multiagent problems (state space size $10^{37}$ and control space size $10^{7}$, respectively). Finally, we incorporate our multiagent rollout algorithms as building blocks in an approximate policy iteration scheme, where successive rollout policies are approximated by using neural network classifiers. While this scheme requires a strictly off-line implementation, it works well in our computational experiments and produces additional significant performance improvement over the single online rollout iteration method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题