贝叶斯土匪的持续时间限制

论文标题

贝叶斯土匪的持续时间限制

Continuous-in-time Limit for Bayesian Bandits

论文作者

Zhu, Yuhua, Izzo, Zachary, Ying, Lexing

论文摘要

本文在贝叶斯环境中重新讨论了强盗问题。贝叶斯方法将匪徒问题提出为优化问题，目标是找到最大程度地减少贝叶斯遗憾的最佳政策。贝叶斯方法面临的主要挑战之一是，最佳策略的计算通常是棘手的，尤其是当问题范围的长度或武器数量很大时。在本文中，我们首先表明，在适当的恢复下，贝叶斯匪徒问题会收敛于连续的汉密尔顿 - 雅各比 - 贝尔曼（HJB）方程。对于几个常见的匪徒问题，可以明确获得限制HJB方程的最佳策略，当没有明确解决方案时，我们提供了求解HJB方程的数值方法。基于这些结果，我们提出了一个近似于贝叶斯的政策，用于解决大范围的贝叶斯匪徒问题。我们的方法具有额外的好处，即随着视野的增加，其计算成本不会增加。

This paper revisits the bandit problem in the Bayesian setting. The Bayesian approach formulates the bandit problem as an optimization problem, and the goal is to find the optimal policy which minimizes the Bayesian regret. One of the main challenges facing the Bayesian approach is that computation of the optimal policy is often intractable, especially when the length of the problem horizon or the number of arms is large. In this paper, we first show that under a suitable rescaling, the Bayesian bandit problem converges toward a continuous Hamilton-Jacobi-Bellman (HJB) equation. The optimal policy for the limiting HJB equation can be explicitly obtained for several common bandit problems, and we give numerical methods to solve the HJB equation when an explicit solution is not available. Based on these results, we propose an approximate Bayes-optimal policy for solving Bayesian bandit problems with large horizons. Our method has the added benefit that its computational cost does not increase as the horizon increases.

下载PDF全文

下载文献需遵守相关版权规定

论文标题