关于非线性RL的无奖励探索的统计效率

论文标题

关于非线性RL的无奖励探索的统计效率

On the Statistical Efficiency of Reward-Free Exploration in Non-Linear RL

论文作者

Chen, Jinglin, Modi, Aditya, Krishnamurthy, Akshay, Jiang, Nan, Agarwal, Alekh

论文摘要

我们在一般的非线性功能近似下研究无奖励增强学习（RL），并在各种标准结构假设下建立样本效率和硬度结果。从积极的一面来看，我们提出了在最小的结构假设下的rfolive（无奖励橄榄）算法，以进行样品有效的无奖励探索，该假设涵盖了先前研究的线性MDPS（Jin等，2020b）的设置（Jin等，2020b），线性完整性，线性完整性（Zanette等，2020b）和低级别的MDS（2020b）和low-Rank MDPS（MOD-nesth已知）。我们的分析表明，以前针对后两个设置的可探索性或可及性假设在统计上对于无奖励探索而言并不是必需的。在负面方面，我们在线性完整性假设下的无奖励和奖励意识探索时，我们为基础特征未知时提供了统计硬度结果，显示了低级别和线性完整性设置之间的指数分离。

We study reward-free reinforcement learning (RL) under general non-linear function approximation, and establish sample efficiency and hardness results under various standard structural assumptions. On the positive side, we propose the RFOLIVE (Reward-Free OLIVE) algorithm for sample-efficient reward-free exploration under minimal structural assumptions, which covers the previously studied settings of linear MDPs (Jin et al., 2020b), linear completeness (Zanette et al., 2020b) and low-rank MDPs with unknown representation (Modi et al., 2021). Our analyses indicate that the explorability or reachability assumptions, previously made for the latter two settings, are not necessary statistically for reward-free exploration. On the negative side, we provide a statistical hardness result for both reward-free and reward-aware exploration under linear completeness assumptions when the underlying features are unknown, showing an exponential separation between low-rank and linear completeness settings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题