具有自我信息奖励的多臂匪徒

论文标题

具有自我信息奖励的多臂匪徒

Multi-Armed Bandits with Self-Information Rewards

论文作者

Weinberger, Nir, Yemini, Michal

论文摘要

本文介绍了信息性多臂强盗（IMAB）模型，在每个回合中，玩家选择一个手臂，观察符号，并以符号的自我信息形式获得未观察到的奖励。因此，手臂的预期奖励是产生其符号的源质量功能的香农熵。该玩家的目标是最大程度地提高与武器的熵值相关的预期奖励。在假设字母大小是已知的假设下，为IMAB模型提出了两种基于UCB的算法，该算法考虑了插件熵估计器的偏见。第一种算法乐观地纠正了熵估计中的偏置项。第二算法依赖于适应熵值较小的源的数据依赖性置信区间。性能保证是通过上限每种算法的预期后悔来提供的。此外，在Bernoulli案例中，将这些算法的渐近行为与伪遗憾的Lai-Robbins的下限进行了比较。此外，在假设\ textIt {Exact}字母大小的假设是未知的，而播放器仅知道其上的宽松上限，提出了一种基于UCB的算法，其中播放器的目的是减少在有限时间方面的未知字母大小所致的遗憾。数字结果说明了本文中介绍的算法的预期遗憾。

This paper introduces the informational multi-armed bandit (IMAB) model in which at each round, a player chooses an arm, observes a symbol, and receives an unobserved reward in the form of the symbol's self-information. Thus, the expected reward of an arm is the Shannon entropy of the probability mass function of the source that generates its symbols. The player aims to maximize the expected total reward associated with the entropy values of the arms played. Under the assumption that the alphabet size is known, two UCB-based algorithms are proposed for the IMAB model which consider the biases of the plug-in entropy estimator. The first algorithm optimistically corrects the bias term in the entropy estimation. The second algorithm relies on data-dependent confidence intervals that adapt to sources with small entropy values. Performance guarantees are provided by upper bounding the expected regret of each of the algorithms. Furthermore, in the Bernoulli case, the asymptotic behavior of these algorithms is compared to the Lai-Robbins lower bound for the pseudo regret. Additionally, under the assumption that the \textit{exact} alphabet size is unknown, and instead the player only knows a loose upper bound on it, a UCB-based algorithm is proposed, in which the player aims to reduce the regret caused by the unknown alphabet size in a finite time regime. Numerical results illustrating the expected regret of the algorithms presented in the paper are provided.

下载PDF全文

下载文献需遵守相关版权规定

论文标题