新鲜：使用人类反馈在高维状态空间中进行交互式奖励成型

论文标题

新鲜：使用人类反馈在高维状态空间中进行交互式奖励成型

FRESH: Interactive Reward Shaping in High-Dimensional State Spaces using Human Feedback

论文作者

Xiao, Baicen, Lu, Qifan, Ramasubramanian, Bhaskar, Clark, Andrew, Bushnell, Linda, Poovendran, Radha

论文摘要

强化学习已成功地培训自主代理以在复杂环境中实现目标。尽管这已经适应了多种设置，包括机器人技术和计算机游戏，但人类玩家经常发现在某些环境中获得更高的奖励比强化学习算法更容易。对于高维状态空间，代理商获得的奖励稀疏或极度延迟，这尤其如此。在本文中，我们试图有效地整合人类操作员提供的反馈信号，并在高维状态空间中使用深度强化学习算法。我们称此新鲜（基于反馈的奖励成型）。在培训期间，向人类操作员提供了重播缓冲区的轨迹，然后就轨迹中的状态和动作提供反馈。为了将人类操作员提供的反馈信号概括为以前看不见的状态和在测试时间的行动，我们使用反馈神经网络。我们使用具有共享网络体系结构的神经网络的合奏来表示模型不确定性和神经网络在其输出中的信心。反馈神经网络的输出被转换为塑造奖励，该奖励已增强为环境提供的奖励。我们在街机学习环境中评估了关于保龄球和滑雪式游戏的方法。尽管人类专家能够在这些环境中取得很高的分数，但最先进的深度学习算法的表现不佳。我们观察到，在这两种环境中，Fresh能够获得比最先进的深度学习算法更高的分数。 Fresh还比保龄球的人类专家高出21.4％，并且在滑雪方面也是专家。

Reinforcement learning has been successful in training autonomous agents to accomplish goals in complex environments. Although this has been adapted to multiple settings, including robotics and computer games, human players often find it easier to obtain higher rewards in some environments than reinforcement learning algorithms. This is especially true of high-dimensional state spaces where the reward obtained by the agent is sparse or extremely delayed. In this paper, we seek to effectively integrate feedback signals supplied by a human operator with deep reinforcement learning algorithms in high-dimensional state spaces. We call this FRESH (Feedback-based REward SHaping). During training, a human operator is presented with trajectories from a replay buffer and then provides feedback on states and actions in the trajectory. In order to generalize feedback signals provided by the human operator to previously unseen states and actions at test-time, we use a feedback neural network. We use an ensemble of neural networks with a shared network architecture to represent model uncertainty and the confidence of the neural network in its output. The output of the feedback neural network is converted to a shaping reward that is augmented to the reward provided by the environment. We evaluate our approach on the Bowling and Skiing Atari games in the arcade learning environment. Although human experts have been able to achieve high scores in these environments, state-of-the-art deep learning algorithms perform poorly. We observe that FRESH is able to achieve much higher scores than state-of-the-art deep learning algorithms in both environments. FRESH also achieves a 21.4% higher score than a human expert in Bowling and does as well as a human expert in Skiing.

下载PDF全文

下载文献需遵守相关版权规定

论文标题