论文标题
无奖励探索加固学习
Reward-Free Exploration for Reinforcement Learning
论文作者
论文摘要
探索被广泛认为是增强学习(RL)最具挑战性的方面之一,许多天真的方法屈服于指数样本的复杂性。为了隔离勘探的挑战,我们提出了一个新的“无奖励RL”框架。在探索阶段,代理商首先从MDP $ \ Mathcal {M} $中收集轨迹,而没有预先指定的奖励功能。探索后,它的任务是计算$ \ Mathcal {M} $的近乎最佳策略,以收集给定的奖励功能。当有许多感兴趣的奖励函数,或者是由外部主体塑造的,以引起所需的行为时,该框架特别适合。 我们给出了一种有效的算法,该算法会导致$ \ tilde {\ Mathcal {o}}(s^2a \ Mathrm {poly}(h)/ε^2)$ exploration $ seppotest and返回$ε$ - $ - $ -Suboptimal的政策,以获得任意数量的奖励功能。我们通过找到探索性政策来实现这一目标,这些探索性政策访问每个“重要”状态,其概率与任何可能的政策下的最大访问概率成正比。此外,我们的计划程序可以由任何黑盒近似计划者(例如价值迭代或自然政策梯度)实例化。我们还给出了几乎匹配的$ω(s^2Ah^2/ε^2)$下限,在这种情况下证明了我们算法的近乎最佳性。
Exploration is widely regarded as one of the most challenging aspects of reinforcement learning (RL), with many naive approaches succumbing to exponential sample complexity. To isolate the challenges of exploration, we propose a new "reward-free RL" framework. In the exploration phase, the agent first collects trajectories from an MDP $\mathcal{M}$ without a pre-specified reward function. After exploration, it is tasked with computing near-optimal policies under for $\mathcal{M}$ for a collection of given reward functions. This framework is particularly suitable when there are many reward functions of interest, or when the reward function is shaped by an external agent to elicit desired behavior. We give an efficient algorithm that conducts $\tilde{\mathcal{O}}(S^2A\mathrm{poly}(H)/ε^2)$ episodes of exploration and returns $ε$-suboptimal policies for an arbitrary number of reward functions. We achieve this by finding exploratory policies that visit each "significant" state with probability proportional to its maximum visitation probability under any possible policy. Moreover, our planning procedure can be instantiated by any black-box approximate planner, such as value iteration or natural policy gradient. We also give a nearly-matching $Ω(S^2AH^2/ε^2)$ lower bound, demonstrating the near-optimality of our algorithm in this setting.