具有方差依赖遗憾界限的对抗性强大的多臂强盗算法

论文标题

具有方差依赖遗憾界限的对抗性强大的多臂强盗算法

Adversarially Robust Multi-Armed Bandit Algorithm with Variance-Dependent Regret Bounds

论文作者

Ito, Shinji, Tsuchiya, Taira, Honda, Junya

论文摘要

本文考虑了多臂匪徒（MAB）问题，并提供了一种新的最佳世界（BOBW）算法，该算法在随机和对抗性设置中几乎可以最佳地工作。在随机设置中，某些现有的BOBW算法获得了$ O（\ sum_ {i：δ_i> 0} \ frac {\ frac {\ log t} {Δ_i} {Δ_i}）$的紧密差异依赖的遗憾界限，用于次要差距gap $ gap $Δ_i$ i $ i $ i $ i $ i $ i $ y $ i $ $ i $ $ t $。如Audibert等。 [2007]但是，在具有低变化的臂的随机环境中，可以提高性能。实际上，他们提供了一种随机MAB算法，其差距依赖性遗憾界限为$ o（\ sum_ {i：Δ_i> 0}（\ frac {σ_i^2} {σ_i^2} {Δ_i} + log t）在本文中，我们提出了第一个具有差距依赖性界限的BOBW算法，表明即使在可能的对抗环境中，这些方差信息也可以使用。此外，我们的间隙变量依赖性结合中的领先常数因子仅（几乎）是下限的值的两倍。此外，所提出的算法在对抗性环境中享有多种与数据有关的遗憾界限，并且在具有对抗性腐败的随机设置中效果很好。所提出的算法基于以下规范化的领导方法，并采用自适应学习率，取决于损失的经验预测误差，这导致了反映了武器方差的差距变化依赖性的遗憾界限。

This paper considers the multi-armed bandit (MAB) problem and provides a new best-of-both-worlds (BOBW) algorithm that works nearly optimally in both stochastic and adversarial settings. In stochastic settings, some existing BOBW algorithms achieve tight gap-dependent regret bounds of $O(\sum_{i: Δ_i>0} \frac{\log T}{Δ_i})$ for suboptimality gap $Δ_i$ of arm $i$ and time horizon $T$. As Audibert et al. [2007] have shown, however, that the performance can be improved in stochastic environments with low-variance arms. In fact, they have provided a stochastic MAB algorithm with gap-variance-dependent regret bounds of $O(\sum_{i: Δ_i>0} (\frac{σ_i^2}{Δ_i} + 1) \log T )$ for loss variance $σ_i^2$ of arm $i$. In this paper, we propose the first BOBW algorithm with gap-variance-dependent bounds, showing that the variance information can be used even in the possibly adversarial environment. Further, the leading constant factor in our gap-variance dependent bound is only (almost) twice the value for the lower bound. Additionally, the proposed algorithm enjoys multiple data-dependent regret bounds in adversarial settings and works well in stochastic settings with adversarial corruptions. The proposed algorithm is based on the follow-the-regularized-leader method and employs adaptive learning rates that depend on the empirical prediction error of the loss, which leads to gap-variance-dependent regret bounds reflecting the variance of the arms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题