土匪的深层等级

论文标题

土匪的深层等级

Deep Hierarchy in Bandits

论文作者

Hong, Joey, Kveton, Branislav, Katariya, Sumeet, Zaheer, Manzil, Ghavamzadeh, Mohammad

论文摘要

行动的平均奖励通常相关。这些相关性的形式可能是复杂且未知的先验形式，例如用户对推荐产品及其类别的偏好。为了最大程度地提高统计效率，在学习时要利用这些相关性很重要。我们制定了这个问题的强盗变体，其中平均动作奖励的相关性由带有潜在变量的层次结构贝叶斯模型表示。由于层次结构可以具有多层，因此我们称其为深层。我们为此问题提出了一种分层汤普森采样算法（HIERTS），并展示如何为高斯层次结构有效实施它。由于后验的新型精确层次表示，因此可以进行有效的实施，该层次本身具有独立的利益。我们使用此确切的后验来分析高斯土匪Hierts的贝叶斯遗憾。我们的分析反映了问题的结构，遗憾会随先前的宽度而减少，还表明层次结构减少了行动数量中非恒定因素的遗憾。我们在综合和现实世界实验中都从经验上证实了这些理论发现。

Mean rewards of actions are often correlated. The form of these correlations may be complex and unknown a priori, such as the preferences of a user for recommended products and their categories. To maximize statistical efficiency, it is important to leverage these correlations when learning. We formulate a bandit variant of this problem where the correlations of mean action rewards are represented by a hierarchical Bayesian model with latent variables. Since the hierarchy can have multiple layers, we call it deep. We propose a hierarchical Thompson sampling algorithm (HierTS) for this problem, and show how to implement it efficiently for Gaussian hierarchies. The efficient implementation is possible due to a novel exact hierarchical representation of the posterior, which itself is of independent interest. We use this exact posterior to analyze the Bayes regret of HierTS in Gaussian bandits. Our analysis reflects the structure of the problem, that the regret decreases with the prior width, and also shows that hierarchies reduce the regret by non-constant factors in the number of actions. We confirm these theoretical findings empirically, in both synthetic and real-world experiments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题