通过一般价值函数的随机好奇心探索

论文标题

通过一般价值函数的随机好奇心探索

Exploring through Random Curiosity with General Value Functions

论文作者

Ramesh, Aditya, Kirsch, Louis, van Steenkiste, Sjoerd, Schmidhuber, Jürgen

论文摘要

加强学习中的有效探索是一个挑战性的问题，通常是通过内在奖励解决的。最近的突出方法基于状态新颖性或人工好奇心的变体。但是，将它们直接应用于部分可观察到的环境可能是无效的，并导致内在奖励的过早耗散。在这里，我们提出了一种随机好奇心，具有一般价值函数（RC-GVF），这是一种新型的内在奖励函数，它借鉴了这些不同方法之间的联系。 RC-GVF不仅仅使用当前观察的新颖性或好奇心加成来预测精确的环境动力学，还可以通过预测时间扩展的一般价值函数来获得内在的奖励。我们证明，这改善了硬探索的恶毒锁问题。此外，在部分可观察到的巨质环境中，RC-GVF在没有地面真相计数的情况下显着优于先前的方法。关于Miligrid的全景观察进一步提高了RC-GVF的性能，因此它与以情节计数形式利用特权信息的基准具有竞争力。

Efficient exploration in reinforcement learning is a challenging problem commonly addressed through intrinsic rewards. Recent prominent approaches are based on state novelty or variants of artificial curiosity. However, directly applying them to partially observable environments can be ineffective and lead to premature dissipation of intrinsic rewards. Here we propose random curiosity with general value functions (RC-GVF), a novel intrinsic reward function that draws upon connections between these distinct approaches. Instead of using only the current observation's novelty or a curiosity bonus for failing to predict precise environment dynamics, RC-GVF derives intrinsic rewards through predicting temporally extended general value functions. We demonstrate that this improves exploration in a hard-exploration diabolical lock problem. Furthermore, RC-GVF significantly outperforms previous methods in the absence of ground-truth episodic counts in the partially observable MiniGrid environments. Panoramic observations on MiniGrid further boost RC-GVF's performance such that it is competitive to baselines exploiting privileged information in form of episodic counts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题