论文标题

具有方差依赖遗憾界限的对抗性强大的多臂强盗算法

Adversarially Robust Multi-Armed Bandit Algorithm with Variance-Dependent Regret Bounds

论文作者

Ito, Shinji, Tsuchiya, Taira, Honda, Junya

论文摘要

本文考虑了多臂匪徒(MAB)问题,并提供了一种新的最佳世界(BOBW)算法,该算法在随机和对抗性设置中几乎可以最佳地工作。在随机设置中,某些现有的BOBW算法获得了$ O(\ sum_ {i:δ_i> 0} \ frac {\ frac {\ log t} {Δ_i} {Δ_i})$的紧密差异依赖的遗憾界限,用于次要差距gap $ gap $Δ_i$ i $ i $ i $ i $ i $ i $ y $ i $ $ i $ $ t $。如Audibert等。 [2007]但是,在具有低变化的臂的随机环境中,可以提高性能。实际上,他们提供了一种随机MAB算法,其差距依赖性遗憾界限为$ o(\ sum_ {i:Δ_i> 0}(\ frac {σ_i^2} {σ_i^2} {Δ_i} + log t)在本文中,我们提出了第一个具有差距依赖性界限的BOBW算法,表明即使在可能的对抗环境中,这些方差信息也可以使用。此外,我们的间隙变量依赖性结合中的领先常数因子仅(几乎)是下限的值的两倍。此外,所提出的算法在对抗性环境中享有多种与数据有关的遗憾界限,并且在具有对抗性腐败的随机设置中效果很好。所提出的算法基于以下规范化的领导方法,并采用自适应学习率,取决于损失的经验预测误差,这导致了反映了武器方差的差距变化依赖性的遗憾界限。

This paper considers the multi-armed bandit (MAB) problem and provides a new best-of-both-worlds (BOBW) algorithm that works nearly optimally in both stochastic and adversarial settings. In stochastic settings, some existing BOBW algorithms achieve tight gap-dependent regret bounds of $O(\sum_{i: Δ_i>0} \frac{\log T}{Δ_i})$ for suboptimality gap $Δ_i$ of arm $i$ and time horizon $T$. As Audibert et al. [2007] have shown, however, that the performance can be improved in stochastic environments with low-variance arms. In fact, they have provided a stochastic MAB algorithm with gap-variance-dependent regret bounds of $O(\sum_{i: Δ_i>0} (\frac{σ_i^2}{Δ_i} + 1) \log T )$ for loss variance $σ_i^2$ of arm $i$. In this paper, we propose the first BOBW algorithm with gap-variance-dependent bounds, showing that the variance information can be used even in the possibly adversarial environment. Further, the leading constant factor in our gap-variance dependent bound is only (almost) twice the value for the lower bound. Additionally, the proposed algorithm enjoys multiple data-dependent regret bounds in adversarial settings and works well in stochastic settings with adversarial corruptions. The proposed algorithm is based on the follow-the-regularized-leader method and employs adaptive learning rates that depend on the empirical prediction error of the loss, which leads to gap-variance-dependent regret bounds reflecting the variance of the arms.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源