论文标题
马尔可夫噪声的随机近似:分析和增强学习中的应用
Stochastic Approximation with Markov Noise: Analysis and applications in reinforcement learning
论文作者
论文摘要
我们首次提出了由“受控”马尔可夫噪声驱动的两个时间尺度随机近似的渐近收敛分析。特别是,除了martingale差异噪声外,递归速度更快,更慢的递归除了马尔可夫噪声组件外。我们通过将框架的渐近行为与限制在两个时间尺度上限制差异包裹物相关联,这些框架的渐近行为是根据与受控的马尔可夫流程相关的千古化职业措施定义的。使用我们的结果的特殊情况,我们为使用线性函数近似的时间差异学习提供了用于时间差异学习的非政策收敛问题的解决方案。当不知道迭代的迭代率是稳定的时,我们以马尔可夫迭代依赖性噪声来编译具有随机近似算法的动力学的几个方面。我们通过扩展锁定概率(即收敛到限制的特定吸引子O.D.E.的概率。鉴于迭代率在足够大量的迭代率(SASE)n_0)框架对此类恢复后的吸引力范围内实现。当迭代满足“渐近性紧密度”条件时,我们使用这些结果几乎可以确保迭代的迭代收敛到指定的吸引子。反过来,这被证明可用于分析一般“自适应”算法的跟踪能力。最后,我们获得了Basu等人提出的策略评估算法功能近似的第一个信息误差界限。当目标是找到使用指数公用事业代表的风险敏感成本时。我们表明,这是由于在状态空间很大时始终存在于我们所有界限中始终存在的差异项所致。
We present for the first time an asymptotic convergence analysis of two time-scale stochastic approximation driven by "controlled" Markov noise. In particular, the faster and slower recursions have non-additive controlled Markov noise components in addition to martingale difference noise. We analyze the asymptotic behavior of our framework by relating it to limiting differential inclusions in both time scales that are defined in terms of the ergodic occupation measures associated with the controlled Markov processes. Using a special case of our results, we present a solution to the off-policy convergence problem for temporal-difference learning with linear function approximation. We compile several aspects of the dynamics of stochastic approximation algorithms with Markov iterate-dependent noise when the iterates are not known to be stable beforehand. We achieve the same by extending the lock-in probability (i.e. the probability of convergence to a specific attractor of the limiting o.d.e. given that the iterates are in its domain of attraction after a sufficiently large number of iterations (say) n_0) framework to such recursions. We use these results to prove almost sure convergence of the iterates to the specified attractor when the iterates satisfy an "asymptotic tightness" condition. This, in turn, is shown to be useful in analyzing the tracking ability of general "adaptive" algorithms. Finally, we obtain the first informative error bounds on function approximation for the policy evaluation algorithm proposed by Basu et al. when the aim is to find the risk-sensitive cost represented using exponential utility. We show that this happens due to the absence of difference term in the earlier bound which is always present in all our bounds when the state space is large.