论文标题

随着时间的推移,上下文匪徒适应不断变化的用户偏好

Contextual Bandits for adapting to changing User preferences over time

论文作者

Rao, Dattaraj

论文摘要

上下文匪徒通过利用在线学习(增量)学习来根据不断变化的环境来持续调整预测,从而为ML中的动态数据问题建模提供了有效方法。我们探讨了有关上下文匪徒的详细信息,这是传统增强学习(RL)问题的扩展,并构建了一种新颖的算法来使用一系列基于动作的学习者解决此问题。我们将这种方法应用于对文章建议系统进行建模,使用随机梯度下降(SGD)学习者来对基于所采取的行动进行奖励进行预测。然后,我们将方法扩展到可公开可用的Movielens数据集并探索发现。首先,我们提供了一个简化的模拟数据集,显示了随着时间的流逝而变化的用户偏好,以及如何通过静态和动态学习算法对此进行评估。该数据集作为本研究的一部分可用,该数据集有意地模拟了数量有限的功能,可用于评估不同的解决问题的策略。我们将使用静态数据集构建一个分类器,并在此数据集上评估其性能。由于固定的上下文在某个时间点,我们显示出静态学习者的局限性,以及如何改变该上下文的准确性。接下来,我们开发了一种用于解决上下文匪徒问题的新颖算法。与线性匪徒类似,该算法将奖励映射为上下文向量的函数,但使用一系列学习者来捕获动作/武器之间的变化。我们使用一系列随机梯度下降(SGD)学习者开发匪徒算法,每个手臂具有单独的学习者。最后,我们将将这种上下文强盗算法应用于标准电影镜头数据集的不同用户随着时间的推移预测电影评分,并演示结果。

Contextual bandits provide an effective way to model the dynamic data problem in ML by leveraging online (incremental) learning to continuously adjust the predictions based on changing environment. We explore details on contextual bandits, an extension to the traditional reinforcement learning (RL) problem and build a novel algorithm to solve this problem using an array of action-based learners. We apply this approach to model an article recommendation system using an array of stochastic gradient descent (SGD) learners to make predictions on rewards based on actions taken. We then extend the approach to a publicly available MovieLens dataset and explore the findings. First, we make available a simplified simulated dataset showing varying user preferences over time and how this can be evaluated with static and dynamic learning algorithms. This dataset made available as part of this research is intentionally simulated with limited number of features and can be used to evaluate different problem-solving strategies. We will build a classifier using static dataset and evaluate its performance on this dataset. We show limitations of static learner due to fixed context at a point of time and how changing that context brings down the accuracy. Next we develop a novel algorithm for solving the contextual bandit problem. Similar to the linear bandits, this algorithm maps the reward as a function of context vector but uses an array of learners to capture variation between actions/arms. We develop a bandit algorithm using an array of stochastic gradient descent (SGD) learners, with separate learner per arm. Finally, we will apply this contextual bandit algorithm to predicting movie ratings over time by different users from the standard Movie Lens dataset and demonstrate the results.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源