随机多军匪徒的最佳算法，带有沉重的尾巴奖励

论文标题

随机多军匪徒的最佳算法，带有沉重的尾巴奖励

Optimal Algorithms for Stochastic Multi-Armed Bandits with Heavy Tailed Rewards

论文作者

Lee, Kyungjae, Yang, Hongjun, Lim, Sungbin, Oh, Songhwai

论文摘要

在本文中，我们考虑了带有重尾奖励的随机多臂土匪（mab），其$ p $ - 第3刻的矩限制在$ 1 <p \ leq2 $的情况下，由常数$ν_{p} $界定。首先，我们提出了一个新颖的稳健估计器，该估计器不需要$ν_{p} $作为先验信息，而其他现有的稳健估计器则要求有关$ν_{p} $的先验知识。我们表明，所提出的估计器的错误概率呈指数速度衰减。使用此估计器，我们提出了一种基于扰动的探索策略，并制定了一种普遍的遗憾分析方案，该方案通过揭示误解的遗憾与累积密度函数之间的关系来提供上下遗憾的界限。从提出的分析方案中，我们获得了各种扰动的依赖性依赖性和差距的上下后悔界。我们还为每种扰动都找到了最佳的超参数，这可以实现相对于总回合的最小值最佳遗憾。在模拟中，与现有的$ p $值相比，提出的估计器显示出优惠的性能，并且在MAB问题上，提出的扰动策略的表现优于现有探索方法。

In this paper, we consider stochastic multi-armed bandits (MABs) with heavy-tailed rewards, whose $p$-th moment is bounded by a constant $ν_{p}$ for $1<p\leq2$. First, we propose a novel robust estimator which does not require $ν_{p}$ as prior information, while other existing robust estimators demand prior knowledge about $ν_{p}$. We show that an error probability of the proposed estimator decays exponentially fast. Using this estimator, we propose a perturbation-based exploration strategy and develop a generalized regret analysis scheme that provides upper and lower regret bounds by revealing the relationship between the regret and the cumulative density function of the perturbation. From the proposed analysis scheme, we obtain gap-dependent and gap-independent upper and lower regret bounds of various perturbations. We also find the optimal hyperparameters for each perturbation, which can achieve the minimax optimal regret bound with respect to total rounds. In simulation, the proposed estimator shows favorable performance compared to existing robust estimators for various $p$ values and, for MAB problems, the proposed perturbation strategy outperforms existing exploration methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题