一种近乎最佳的最佳世界世界算法，用于在线学习，并使用反馈图

论文标题

一种近乎最佳的最佳世界世界算法，用于在线学习，并使用反馈图

A Near-Optimal Best-of-Both-Worlds Algorithm for Online Learning with Feedback Graphs

论文作者

Rouyer, Chloé, van der Hoeven, Dirk, Cesa-Bianchi, Nicolò, Seldin, Yevgeny

论文摘要

我们考虑使用反馈图，这是一个顺序的决策框架，其中学习者的反馈由动作集的有向图确定。我们提出了一种在此框架中学习的计算有效算法，该算法同时在随机和对抗环境中都达到了近乎最佳的后悔界限。针对遗忘的对手的界限是$ \ tilde {o}（\ sqrt {αt}）$，其中$ t $是时间范围，$α$是反馈图的独立数。 The bound against stochastic environments is $O\big( (\ln T)^2 \max_{S\in \mathcal I(G)} \sum_{i \in S} Δ_i^{-1}\big)$ where $\mathcal I(G)$ is the family of all independent sets in a suitably defined undirected version of the graph and $Δ_i$ are the次要差距。该算法结合了Exp3 ++算法的想法，用于随机和对抗性匪徒以及Exp3.g算法，用于反馈图的算法和新颖的探索方案。利用图形结构以减少探索的方案是通过反馈图获得最佳世界的关键。我们还将算法和结果扩展到允许反馈图随时间变化的设置。

We consider online learning with feedback graphs, a sequential decision-making framework where the learner's feedback is determined by a directed graph over the action set. We present a computationally efficient algorithm for learning in this framework that simultaneously achieves near-optimal regret bounds in both stochastic and adversarial environments. The bound against oblivious adversaries is $\tilde{O} (\sqrt{αT})$, where $T$ is the time horizon and $α$ is the independence number of the feedback graph. The bound against stochastic environments is $O\big( (\ln T)^2 \max_{S\in \mathcal I(G)} \sum_{i \in S} Δ_i^{-1}\big)$ where $\mathcal I(G)$ is the family of all independent sets in a suitably defined undirected version of the graph and $Δ_i$ are the suboptimality gaps. The algorithm combines ideas from the EXP3++ algorithm for stochastic and adversarial bandits and the EXP3.G algorithm for feedback graphs with a novel exploration scheme. The scheme, which exploits the structure of the graph to reduce exploration, is key to obtain best-of-both-worlds guarantees with feedback graphs. We also extend our algorithm and results to a setting where the feedback graphs are allowed to change over time.

下载PDF全文

下载文献需遵守相关版权规定

论文标题