论文标题
与私人信息的人类指导人类机器互动的离线增强学习
Offline Reinforcement Learning for Human-Guided Human-Machine Interaction with Private Information
论文作者
论文摘要
由人机互动的激励,例如培训聊天机器人以提高客户满意度,我们研究了涉及私人信息的人类指导的人机互动。我们将这种互动建模为一个基于两人的转弯游戏,其中一位玩家(爱丽丝,人类)将另一个玩家(鲍勃,一台机器)引导到一个共同的目标。具体来说,我们专注于此游戏中的离线增强学习(RL),其目标是为爱丽丝和鲍勃找到一对政策对,该政策对基于离线数据集的先验收集到了最大化的预期总奖励。离线设置提出了两个挑战:(i)我们无法收集鲍勃的私人信息,在使用标准RL方法时会导致混杂的偏见,以及(ii)用于收集数据的行为策略与我们旨在学习的所需策略之间的分布不匹配。为了解决混杂的偏见,我们将鲍勃先前的动作视为爱丽丝当前决策的工具变量,以适应未衡量的混杂。我们开发了一种新颖的识别结果,并使用它来提出一种新的非政策评估方法(OPE)方法,以评估这个基于两人的转弯游戏中的策略对。为了解决分销不匹配,我们利用悲观主义的想法,并使用我们的OPE方法来开发一种非政策学习算法,以寻找爱丽丝和鲍勃的理想政策对。最后,我们证明,在诸如离线数据部分覆盖的温和假设下,通过我们的方法获得的政策对以令人满意的速度收敛到最佳效果。
Motivated by the human-machine interaction such as training chatbots for improving customer satisfaction, we study human-guided human-machine interaction involving private information. We model this interaction as a two-player turn-based game, where one player (Alice, a human) guides the other player (Bob, a machine) towards a common goal. Specifically, we focus on offline reinforcement learning (RL) in this game, where the goal is to find a policy pair for Alice and Bob that maximizes their expected total rewards based on an offline dataset collected a priori. The offline setting presents two challenges: (i) We cannot collect Bob's private information, leading to a confounding bias when using standard RL methods, and (ii) a distributional mismatch between the behavior policy used to collect data and the desired policy we aim to learn. To tackle the confounding bias, we treat Bob's previous action as an instrumental variable for Alice's current decision making so as to adjust for the unmeasured confounding. We develop a novel identification result and use it to propose a new off-policy evaluation (OPE) method for evaluating policy pairs in this two-player turn-based game. To tackle the distributional mismatch, we leverage the idea of pessimism and use our OPE method to develop an off-policy learning algorithm for finding a desirable policy pair for both Alice and Bob. Finally, we prove that under mild assumptions such as partial coverage of the offline data, the policy pair obtained through our method converges to the optimal one at a satisfactory rate.