论文标题
推荐系统的本地政策改进
Local Policy Improvement for Recommender Systems
论文作者
论文摘要
推荐系统可以根据其过去的交互预测用户将与接下来相互作用的项目。这个问题通常是通过监督学习来解决的,但是最近的进步已转向奖励的政策优化(例如,用户参与度)。后者的一个挑战是政策不匹配:我们只能培训从以前部署的政策中收集的数据的新政策。解决此问题的常规方法是通过重要性采样校正,但这是有实际限制。我们建议在不进行政策校正的情况下进行当地政策改进的另一种方法。我们的方法计算和优化了目标策略的预期奖励的下限,这很容易从数据估算,并且不涉及密度比(例如出现在重要性采样校正中的密度比)。这种本地政策改进范式是推荐系统的理想选择,因为以前的政策通常具有不错的质量,并且政策经常更新。我们提供经验证据和实用食谱,以在连续推荐设置中应用我们的技术。
Recommender systems predict what items a user will interact with next, based on their past interactions. The problem is often approached through supervised learning, but recent advancements have shifted towards policy optimization of rewards (e.g., user engagement). One challenge with the latter is policy mismatch: we are only able to train a new policy given data collected from a previously-deployed policy. The conventional way to address this problem is through importance sampling correction, but this comes with practical limitations. We suggest an alternative approach of local policy improvement without off-policy correction. Our method computes and optimizes a lower bound of expected reward of the target policy, which is easy to estimate from data and does not involve density ratios (such as those appearing in importance sampling correction). This local policy improvement paradigm is ideal for recommender systems, as previous policies are typically of decent quality and policies are updated frequently. We provide empirical evidence and practical recipes for applying our technique in a sequential recommendation setting.