论文标题
乐观的PAC增强学习:与实例有关的观点
Optimistic PAC Reinforcement Learning: the Instance-Dependent View
论文作者
论文摘要
从最小值和实例依赖性视图中,已经对乐观算法进行了广泛的研究,以在情节表格MDP中进行遗憾的最小化。但是,对于PAC RL问题,目标是确定具有很高可能性的近乎最佳策略,对它们的实例依赖性样本复杂性知之甚少。 Wagenmaker等人的负面结果。 (2021)表明,乐观的抽样规则不能用于达到(仍然难以捉摸的)最佳实例依赖性样本复杂性。在正面,我们为PAC RL,BPI-UCRL的乐观算法提供了第一个依赖于实例的限制,仅提供最小值保证(Kaufmann等,2021)。尽管我们的界限具有一些最小的访问概率,但与先前工作中出现的价值差距相比,它也具有优化的次要差距的精致概念。此外,在具有确定性过渡的MDP中,我们表明BPI-UCRL实际上是几乎最佳的。从技术方面来说,由于独立兴趣的新“目标技巧”,我们的分析非常简单。我们用新的硬度结果补充了这些发现,解释了为什么与Minimax政权不同,为什么PAC RL的实例依赖性复杂性与遗憾最小化的复杂性不易与遗憾最小化的关系。
Optimistic algorithms have been extensively studied for regret minimization in episodic tabular MDPs, both from a minimax and an instance-dependent view. However, for the PAC RL problem, where the goal is to identify a near-optimal policy with high probability, little is known about their instance-dependent sample complexity. A negative result of Wagenmaker et al. (2021) suggests that optimistic sampling rules cannot be used to attain the (still elusive) optimal instance-dependent sample complexity. On the positive side, we provide the first instance-dependent bound for an optimistic algorithm for PAC RL, BPI-UCRL, for which only minimax guarantees were available (Kaufmann et al., 2021). While our bound features some minimal visitation probabilities, it also features a refined notion of sub-optimality gap compared to the value gaps that appear in prior work. Moreover, in MDPs with deterministic transitions, we show that BPI-UCRL is actually near-optimal. On the technical side, our analysis is very simple thanks to a new "target trick" of independent interest. We complement these findings with a novel hardness result explaining why the instance-dependent complexity of PAC RL cannot be easily related to that of regret minimization, unlike in the minimax regime.