依赖实例的信心和早期停止增强学习

论文标题

依赖实例的信心和早期停止增强学习

Instance-Dependent Confidence and Early Stopping for Reinforcement Learning

论文作者

Khamaru, Koulik, Xia, Eric, Wainwright, Martin J., Jordan, Michael I.

论文摘要

加固学习（RL）的各种算法表现出其收敛速率随问题结构的函数的巨大变化。这种依赖问题的行为不会被最坏情况分析所捕获，因此启发了越来越多的努力，以获得实例依赖性保证并为RL问题提供实例 - 最佳算法。但是，这项研究主要是在理论的范围内进行的，提供了解释\ textit {ex post}绩效差异的保证。自然的下一步是将这些理论保证转换为在实践中有用的准则。我们解决了为策略评估问题获得尖锐的实例依赖置信区和MDP的最佳价值估计问题的问题，允许访问实例 - 最佳算法。结果，我们为实例 - 最佳算法提出了一个与数据有关的停止规则。拟议的停止规则适应了问题的特定实例难度，并允许提早终止有利结构的问题。

Various algorithms for reinforcement learning (RL) exhibit dramatic variation in their convergence rates as a function of problem structure. Such problem-dependent behavior is not captured by worst-case analyses and has accordingly inspired a growing effort in obtaining instance-dependent guarantees and deriving instance-optimal algorithms for RL problems. This research has been carried out, however, primarily within the confines of theory, providing guarantees that explain \textit{ex post} the performance differences observed. A natural next step is to convert these theoretical guarantees into guidelines that are useful in practice. We address the problem of obtaining sharp instance-dependent confidence regions for the policy evaluation problem and the optimal value estimation problem of an MDP, given access to an instance-optimal algorithm. As a consequence, we propose a data-dependent stopping rule for instance-optimal algorithms. The proposed stopping rule adapts to the instance-specific difficulty of the problem and allows for early termination for problems with favorable structure.

下载PDF全文

下载文献需遵守相关版权规定

论文标题