图上的多臂强盗学习

论文标题

图上的多臂强盗学习

Multi-armed Bandit Learning on a Graph

论文作者

Zhang, Tianpeng, Johansson, Kasper, Li, Na

论文摘要

多臂强盗（MAB）问题是一个简单而强大的框架，在不确定性下的决策背景下已经进行了广泛的研究。在许多实际应用程序（例如机器人应用程序）中，选择ARM对应于限制下一个可用臂（动作）选择的物理动作。在此激励的情况下，我们研究了一个称为图形匪徒的MAB的扩展，在该扩展过程中，Agent在图形上行驶，以最大程度地从不同的节点收集的奖励。该图定义了代理在每个步骤中选择下一个可用节点的自由。我们假设图形结构完全可用，但是奖励分布未知。我们建立在基于脱机图的计划算法和乐观原则的基础上，我们设计了一种学习算法G-UCB，该算法使用乐观原则来平衡长期探索 - 探索 - 探索。我们表明，我们提出的算法达到$ o（\ sqrt {| s | t \ log（t）}+d | s | s | \ log t）$学习遗憾，其中$ | s | $是节点的数量，$ d $是图表的直径，与理论下限$ω（\ sqrt）匹配，与理论的下限$ω（\ sqrt）相匹配。据我们所知，这一结果是通过已知的确定性过渡的非剧烈，未缩小的学习问题的第一个紧密遗憾界限之一。数值实验证实，我们的算法表现优于几个基准。

The multi-armed bandit(MAB) problem is a simple yet powerful framework that has been extensively studied in the context of decision-making under uncertainty. In many real-world applications, such as robotic applications, selecting an arm corresponds to a physical action that constrains the choices of the next available arms (actions). Motivated by this, we study an extension of MAB called the graph bandit, where an agent travels over a graph to maximize the reward collected from different nodes. The graph defines the agent's freedom in selecting the next available nodes at each step. We assume the graph structure is fully available, but the reward distributions are unknown. Built upon an offline graph-based planning algorithm and the principle of optimism, we design a learning algorithm, G-UCB, that balances long-term exploration-exploitation using the principle of optimism. We show that our proposed algorithm achieves $O(\sqrt{|S|T\log(T)}+D|S|\log T)$ learning regret, where $|S|$ is the number of nodes and $D$ is the diameter of the graph, which matches the theoretical lower bound $Ω(\sqrt{|S|T})$ up to logarithmic factors. To our knowledge, this result is among the first tight regret bounds in non-episodic, un-discounted learning problems with known deterministic transitions. Numerical experiments confirm that our algorithm outperforms several benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题