平均奖励马尔可夫决策过程的随机一阶方法

论文标题

平均奖励马尔可夫决策过程的随机一阶方法

Stochastic first-order methods for average-reward Markov decision processes

论文作者

Li, Tianjiao, Wu, Feiyang, Lan, Guanghui

论文摘要

我们研究平均奖励马尔可夫决策过程（AMDP），并开发具有强大理论保证的新型一阶方法，以进行政策优化和政策评估。与折现MDP的策略梯度方法有限样本分析中的密集研究工作相比，对AMDP的策略梯度方法的现有研究主要集中在限制性假设下的遗憾界限上，并且它们通常缺乏对整体样本复杂性的保证。为此，我们开发了一种平均奖励随机政策镜下降（SPMD）方法，用于解决有或没有正规化器的AMDP，并根据长期的平均奖励提供收敛保证。为了进行政策评估，由于动作空间中缺乏勘探，现有的政治方法遭受了次级最佳收敛率以及处理不足的随机策略的失败。为了解决这些问题，我们开发了一种降低的时间差（VRTD）方法，该方法具有随机策略的线性函数近似以及最佳收敛保证，并设计了解决勘探问题并提供可比收敛保证的探索性VRTD方法。通过结合政策评估和策略优化零件，我们建立了在生成和马尔可夫噪声模型下解决AMDP的样本复杂性结果。值得注意的是，当利用线性函数近似值时，我们的算法仅需要在低维参数空间中进行更新，因此可以处理具有较大状态和动作空间的MDP。

We study average-reward Markov decision processes (AMDPs) and develop novel first-order methods with strong theoretical guarantees for both policy optimization and policy evaluation. Compared with intensive research efforts in finite sample analysis of policy gradient methods for discounted MDPs, existing studies on policy gradient methods for AMDPs mostly focus on regret bounds under restrictive assumptions, and they often lack guarantees on the overall sample complexities. Towards this end, we develop an average-reward stochastic policy mirror descent (SPMD) method for solving AMDPs with and without regularizers and provide convergence guarantees in terms of the long-term average reward. For policy evaluation, existing on-policy methods suffer from sub-optimal convergence rates as well as failure in handling insufficiently random policies due to the lack of exploration in the action space. To remedy these issues, we develop a variance-reduced temporal difference (VRTD) method with linear function approximation for randomized policies along with optimal convergence guarantees, and design an exploratory VRTD method that resolves the exploration issue and provides comparable convergence guarantees. By combining the policy evaluation and policy optimization parts, we establish sample complexity results for solving AMDPs under both generative and Markovian noise models. It is worth noting that when linear function approximation is utilized, our algorithm only needs to update in the low-dimensional parameter space and thus can handle MDPs with large state and action spaces.

下载PDF全文

下载文献需遵守相关版权规定

论文标题