论文标题

在线模型选择:休息的匪徒配方

Online Model Selection: a Rested Bandit Formulation

论文作者

Cella, Leonardo, Gentile, Claudio, Pontil, Massimiliano

论文摘要

在使用匪徒信息的在线模型选择中,我们介绍并分析了静止的强盗设置中的最佳手臂识别问题,其中手臂预期损失随着臂的次数而减少。预期损耗功能的形状在整个手臂之间相似,并且假定可以直接可以随时学习的未知参数。我们为这个问题定义了一个新颖的遗憾概念,我们将在游戏结束时始终发挥最小的预期损失的政策进行比较。我们分析了一种消除手臂的算法,随着时间范围的增加,其遗憾消失了。实际收敛速度以详细的方式取决于预期损失的假定功能形式。与最近的匪徒文献中已知的模型选择工作不同,我们的算法利用了问题的特定结构,以了解预期损耗函数的未知参数,以便尽快识别最佳的臂。我们通过下限进行补充,表明所提出的解决方案的优势和局限性。

Motivated by a natural problem in online model selection with bandit information, we introduce and analyze a best arm identification problem in the rested bandit setting, wherein arm expected losses decrease with the number of times the arm has been played. The shape of the expected loss functions is similar across arms, and is assumed to be available up to unknown parameters that have to be learned on the fly. We define a novel notion of regret for this problem, where we compare to the policy that always plays the arm having the smallest expected loss at the end of the game. We analyze an arm elimination algorithm whose regret vanishes as the time horizon increases. The actual rate of convergence depends in a detailed way on the postulated functional form of the expected losses. Unlike known model selection efforts in the recent bandit literature, our algorithm exploits the specific structure of the problem to learn the unknown parameters of the expected loss function so as to identify the best arm as quickly as possible. We complement our analysis with a lower bound, indicating strengths and limitations of the proposed solution.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源