论文标题
土匪的平均边界
Bandits with Mean Bounds
论文作者
论文摘要
我们研究匪徒问题的变体,其中提供了每个臂平均值的边界形式的侧面信息。我们证明,这些转化为对亚高斯因素的更严格的估计,并开发了利用这些估计值的新算法。在线性设置中,我们介绍了限制的OFUL(R-ful)算法,该算法还使用问题的几何特性来(可能)(可能)限制正在播放的武器集并降低次优臂的勘探率。在随机情况下,我们提出了采用推断的Subgaussian估计来适应武器勘探率的非自动全局探索(胶)算法。我们分析了Ruul和胶水的遗憾,表明我们的遗憾上限从未比标准的OFUL和UCB算法更糟糕。此外,我们还考虑了从均值界限自然出现的混杂日志中学习的实际动机设置。
We study a variant of the bandit problem where side information in the form of bounds on the mean of each arm is provided. We prove that these translate to tighter estimates of subgaussian factors and develop novel algorithms that exploit these estimates. In the linear setting, we present the Restricted-set OFUL (R-OFUL) algorithm that additionally uses the geometric properties of the problem to (potentially) restrict the set of arms being played and reduce exploration rates for suboptimal arms. In the stochastic case, we propose the non-optimistic Global Under-Explore (GLUE) algorithm which employs the inferred subgaussian estimates to adapt the rate of exploration for the arms. We analyze the regret of R-OFUL and GLUE, showing that our regret upper bounds are never worse than that of the standard OFUL and UCB algorithms respectively. Further, we also consider a practically motivated setting of learning from confounded logs where mean bounds appear naturally.