通过贝叶斯优化的逆增强学习中奖励功能有效探索奖励功能

论文标题

通过贝叶斯优化的逆增强学习中奖励功能有效探索奖励功能

Efficient Exploration of Reward Functions in Inverse Reinforcement Learning via Bayesian Optimization

论文作者

Balakrishnan, Sreejith, Nguyen, Quoc Phong, Low, Bryan Kian Hsiang, Soh, Harold

论文摘要

逆增强学习（IRL）的问题与各种任务有关，包括价值一致性和从演示中学习的机器人学习。尽管近年来算法的巨大贡献，但IRL仍然是一个核心的问题。如果没有事先知识或补充信息，多个奖励功能与观察到的行为相吻合，而实际的奖励功能是无法识别的。本文介绍了一个称为贝叶斯优化的IRLIAGIAN-EIRL（BO-EIRL）的IRL框架，该框架通过有效探索奖励功能空间来识别与专家演示相一致的多个解决方案。 Bo-irl通过利用贝叶斯优化以及我们新提出的内核来实现这一目标，该核心（a）将策略不变奖励函数投射到潜在空间中的一个点，（b）确保潜在空间中的附近点对应于奖励功能，对应于产生类似可能性的奖励功能。该投影允许在潜在空间中使用标准固定内核来捕获奖励功能空间中存在的相关性。关于合成和现实世界环境（基于模型和模型）的经验结果表明，Bo-irl发现了多个奖励功能，同时最大程度地减少了昂贵的精确策略优化的数量。

The problem of inverse reinforcement learning (IRL) is relevant to a variety of tasks including value alignment and robot learning from demonstration. Despite significant algorithmic contributions in recent years, IRL remains an ill-posed problem at its core; multiple reward functions coincide with the observed behavior and the actual reward function is not identifiable without prior knowledge or supplementary information. This paper presents an IRL framework called Bayesian optimization-IRL (BO-IRL) which identifies multiple solutions that are consistent with the expert demonstrations by efficiently exploring the reward function space. BO-IRL achieves this by utilizing Bayesian Optimization along with our newly proposed kernel that (a) projects the parameters of policy invariant reward functions to a single point in a latent space and (b) ensures nearby points in the latent space correspond to reward functions yielding similar likelihoods. This projection allows the use of standard stationary kernels in the latent space to capture the correlations present across the reward function space. Empirical results on synthetic and real-world environments (model-free and model-based) show that BO-IRL discovers multiple reward functions while minimizing the number of expensive exact policy optimizations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题