论文标题
依赖实例的信心和早期停止增强学习
Instance-Dependent Confidence and Early Stopping for Reinforcement Learning
论文作者
论文摘要
加固学习(RL)的各种算法表现出其收敛速率随问题结构的函数的巨大变化。这种依赖问题的行为不会被最坏情况分析所捕获,因此启发了越来越多的努力,以获得实例依赖性保证并为RL问题提供实例 - 最佳算法。但是,这项研究主要是在理论的范围内进行的,提供了解释\ textit {ex post}绩效差异的保证。自然的下一步是将这些理论保证转换为在实践中有用的准则。我们解决了为策略评估问题获得尖锐的实例依赖置信区和MDP的最佳价值估计问题的问题,允许访问实例 - 最佳算法。结果,我们为实例 - 最佳算法提出了一个与数据有关的停止规则。拟议的停止规则适应了问题的特定实例难度,并允许提早终止有利结构的问题。
Various algorithms for reinforcement learning (RL) exhibit dramatic variation in their convergence rates as a function of problem structure. Such problem-dependent behavior is not captured by worst-case analyses and has accordingly inspired a growing effort in obtaining instance-dependent guarantees and deriving instance-optimal algorithms for RL problems. This research has been carried out, however, primarily within the confines of theory, providing guarantees that explain \textit{ex post} the performance differences observed. A natural next step is to convert these theoretical guarantees into guidelines that are useful in practice. We address the problem of obtaining sharp instance-dependent confidence regions for the policy evaluation problem and the optimal value estimation problem of an MDP, given access to an instance-optimal algorithm. As a consequence, we propose a data-dependent stopping rule for instance-optimal algorithms. The proposed stopping rule adapts to the instance-specific difficulty of the problem and allows for early termination for problems with favorable structure.