评估代理而没有奖励

论文标题

评估代理而没有奖励

Evaluating Agents without Rewards

论文作者

Matusch, Brendon, Ba, Jimmy, Hafner, Danijar

论文摘要

强化学习使代理商能够在未知环境中解决具有挑战性的任务。但是，手动制作奖励功能可能会耗时，容易发生人为错误。已经提出了竞争目标，让代理商在没有外部监督的情况下学习，但尚不清楚它们如何反映任务奖励或人类行为。为了加快内在目标的发展，我们回顾性地计算了预采用的代理行为数据集的潜在目标，而不是在线优化它们，并通过分析其相关性来对其进行比较。我们研究了七个代理商，三个Atari游戏和3D游戏Minecraft的输入熵，信息增益和授权。我们发现，这三个内在目标与人类行为相似性度量的相关性与任务奖励更加密切。此外，输入熵和信息增益与人类的相似性相比，与任务奖励更加密切相关，这表明使用固有目标来设计与人类参与者相似的代理。

Reinforcement learning has enabled agents to solve challenging tasks in unknown environments. However, manually crafting reward functions can be time consuming, expensive, and error prone to human error. Competing objectives have been proposed for agents to learn without external supervision, but it has been unclear how well they reflect task rewards or human behavior. To accelerate the development of intrinsic objectives, we retrospectively compute potential objectives on pre-collected datasets of agent behavior, rather than optimizing them online, and compare them by analyzing their correlations. We study input entropy, information gain, and empowerment across seven agents, three Atari games, and the 3D game Minecraft. We find that all three intrinsic objectives correlate more strongly with a human behavior similarity metric than with task reward. Moreover, input entropy and information gain correlate more strongly with human similarity than task reward does, suggesting the use of intrinsic objectives for designing agents that behave similarly to human players.

下载PDF全文

下载文献需遵守相关版权规定

论文标题