随机变化不平等的简单和最佳方法，ii：马尔可夫噪声和政策评估

论文标题

随机变化不平等的简单和最佳方法，ii：马尔可夫噪声和政策评估

Simple and optimal methods for stochastic variational inequalities, II: Markovian noise and policy evaluation in reinforcement learning

论文作者

Kotsalis, Georgios, Lan, Guanghui, Li, Tianjiao

论文摘要

本文的重点是马尔可夫噪声下的随机变异不平等（VI）。我们算法发展的主要应用是加强学习中的随机政策评估问题。文献中的先前研究集中在时间差异（TD）学习中，该学习是采用由随机亚级别下降的非平滑有限时间分析，导致某些局限性。这些内容包括分析涉及A-Priori定义的欧几里得球的修改的TD算法的要求，实现了非最佳收敛速率，并且没有明确的方式来推导并行实施的有益效果。我们的方法是在随机VIS的更广泛背景下，尤其是在随机政策评估方面。我们开发了各种简单的TD学习类型算法，其原始版本保持其简单性，同时从非反应分析的角度提供了明显的优势。我们首先提供了对标准TD算法的改进分析，该算法可以受益于并行实施。然后，我们介绍有条件的TD算法（CTD）的版本，该版本涉及随机迭代的定期更新，从而减少偏差并因此表现出改善的迭代复杂性。这将我们带入了结合了CTD元素和随机操作员外推法的快速TD（FTD）算法。对于新的索引重置策略，FTD表现出最著名的收敛率。我们还设计了一种强大的算法版本，该版本特别适合接近1的折现因子。

The focus of this paper is on stochastic variational inequalities (VI) under Markovian noise. A prominent application of our algorithmic developments is the stochastic policy evaluation problem in reinforcement learning. Prior investigations in the literature focused on temporal difference (TD) learning by employing nonsmooth finite time analysis motivated by stochastic subgradient descent leading to certain limitations. These encompass the requirement of analyzing a modified TD algorithm that involves projection to an a-priori defined Euclidean ball, achieving a non-optimal convergence rate and no clear way of deriving the beneficial effects of parallel implementation. Our approach remedies these shortcomings in the broader context of stochastic VIs and in particular when it comes to stochastic policy evaluation. We developed a variety of simple TD learning type algorithms motivated by its original version that maintain its simplicity, while offering distinct advantages from a non-asymptotic analysis point of view. We first provide an improved analysis of the standard TD algorithm that can benefit from parallel implementation. Then we present versions of a conditional TD algorithm (CTD), that involves periodic updates of the stochastic iterates, which reduce the bias and therefore exhibit improved iteration complexity. This brings us to the fast TD (FTD) algorithm which combines elements of CTD and the stochastic operator extrapolation method of the companion paper. For a novel index resetting policy FTD exhibits the best known convergence rate. We also devised a robust version of the algorithm that is particularly suitable for discounting factors close to 1.

下载PDF全文

下载文献需遵守相关版权规定

论文标题