论文标题
重新审视潜在土匪
Latent Bandits Revisited
论文作者
论文摘要
一个潜在的强盗问题是学习代理人知道以未知离散潜在状态为条件的手臂奖励分布。代理的主要目标是识别潜在状态,之后它可以最佳地采取行动。这种设置是在线学习和离线学习之间的自然中点 - 可以通过在线识别潜在状态的代理商在线识别的代理 - 在推荐系统中实用相关性。在这项工作中,我们基于上置信度范围(UCB)和汤普森(Thompson)采样提出了此设置的一般算法。我们的方法是上下文,并且意识到模型不确定性和错误指定。我们对算法提供了统一的理论分析,当潜在状态的数量小于行动时,其遗憾比经典的强盗政策低。一项全面的经验研究展示了我们方法的优势。
A latent bandit problem is one in which the learning agent knows the arm reward distributions conditioned on an unknown discrete latent state. The primary goal of the agent is to identify the latent state, after which it can act optimally. This setting is a natural midpoint between online and offline learning---complex models can be learned offline with the agent identifying latent state online---of practical relevance in, say, recommender systems. In this work, we propose general algorithms for this setting, based on both upper confidence bounds (UCBs) and Thompson sampling. Our methods are contextual and aware of model uncertainty and misspecification. We provide a unified theoretical analysis of our algorithms, which have lower regret than classic bandit policies when the number of latent states is smaller than actions. A comprehensive empirical study showcases the advantages of our approach.