论文标题
基于确定点过程的无监督发现的统一算法框架
A Unified Algorithm Framework for Unsupervised Discovery of Skills based on Determinantal Point Process
论文作者
论文摘要
在不受外部奖励监督的情况下,在期权框架下学习丰富的技能是强化学习研究的前沿。现有作品主要分为两个独特的类别:各种期权发现,通过相互信息丢失(同时忽略覆盖范围)和基于拉普拉斯的方法最大化选项的多样性,这些方法专注于通过提高状态空间的连接性(同时忽略多样性)来提高期权的覆盖范围。在本文中,我们表明,在相同的数学框架下,无监督期权发现中的多样性和覆盖范围确实可以统一。具体来说,我们通过新颖的确定点过程(DPP)明确量化了学习期权的多样性和覆盖范围,并优化了这些目标,以发现具有出色多样性和覆盖范围的选项。我们提出的算法ODPP对使用Mujoco和Atari创建的具有挑战性的任务进行了广泛的评估。结果表明,我们的算法在多样性和覆盖范围驱动的类别中都优于最先进的基线。
Learning rich skills under the option framework without supervision of external rewards is at the frontier of reinforcement learning research. Existing works mainly fall into two distinctive categories: variational option discovery that maximizes the diversity of the options through a mutual information loss (while ignoring coverage) and Laplacian-based methods that focus on improving the coverage of options by increasing connectivity of the state space (while ignoring diversity). In this paper, we show that diversity and coverage in unsupervised option discovery can indeed be unified under the same mathematical framework. To be specific, we explicitly quantify the diversity and coverage of the learned options through a novel use of Determinantal Point Process (DPP) and optimize these objectives to discover options with both superior diversity and coverage. Our proposed algorithm, ODPP, has undergone extensive evaluation on challenging tasks created with Mujoco and Atari. The results demonstrate that our algorithm outperforms state-of-the-art baselines in both diversity- and coverage-driven categories.