论文标题
非平稳决斗匪徒
Non-Stationary Dueling Bandits
论文作者
论文摘要
我们研究了$ k $ and的非平稳决斗匪徒问题,其中时间范围$ t $由$ m $固定的细分市场组成,每个细分市场与其自身的首选矩阵有关。学习者反复选择一对臂,并观察它们之间的二进制优先级作为反馈。为了最大程度地减少累积的遗憾,尽管偏好矩阵和未知数,但学习者仍需要尽可能多地选择每个固定细分市场的Condorcet获胜者。我们提出了$ \ mathrm {beat \ the \,winner \,reset} $算法,并在固定案例中证明了其预期的二进制弱遗憾的界限,这加剧了当前最新算法的界限。我们还为非固定案件表现出遗憾,而无需知识$ m $或$ t $。我们进一步提出和分析了两个元算法,$ \ mathrm {fintect} $ for弱遗憾和$ \ mathrm {监视\,Dueling \,Dueling \,Bandits} $,以备强烈遗憾,既基于检测 - Window方法,都可以将任何Dueling bandit algorith Algorithm as a Blackbobs Algobbox AlgoRith组合在一起。最后,我们证明了在非统计案件中预期的弱遗憾的最严重的下限。
We study the non-stationary dueling bandits problem with $K$ arms, where the time horizon $T$ consists of $M$ stationary segments, each of which is associated with its own preference matrix. The learner repeatedly selects a pair of arms and observes a binary preference between them as feedback. To minimize the accumulated regret, the learner needs to pick the Condorcet winner of each stationary segment as often as possible, despite preference matrices and segment lengths being unknown. We propose the $\mathrm{Beat\, the\, Winner\, Reset}$ algorithm and prove a bound on its expected binary weak regret in the stationary case, which tightens the bound of current state-of-art algorithms. We also show a regret bound for the non-stationary case, without requiring knowledge of $M$ or $T$. We further propose and analyze two meta-algorithms, $\mathrm{DETECT}$ for weak regret and $\mathrm{Monitored\, Dueling\, Bandits}$ for strong regret, both based on a detection-window approach that can incorporate any dueling bandit algorithm as a black-box algorithm. Finally, we prove a worst-case lower bound for expected weak regret in the non-stationary case.