有效地用随机镜下降求解MDP

论文标题

有效地用随机镜下降求解MDP

Efficiently Solving MDPs with Stochastic Mirror Descent

论文作者

Jin, Yujia, Sidford, Aaron

论文摘要

我们提出了一个基于原始的双重随机镜下降的统一框架，用于求解生成模型的大约解决无限 - 马尔可夫决策过程（MDPS）。当使用$ a_ {tot} $的平均奖励MDP时，总状态行动对和混合时间绑定的时间限制了$ t_ {mix} $，我们的方法将使用预期的$ \ wideTilde {o}（t_ {mix}^2 a_ ___ {2 a_ {tot} $ matrion sampries the sampries the the themant y same $ matrion the-pription $ the）先验艺术的依赖性。当应用于$ a_ {tot} $的$γ$ discounted MDP时$（1-γ）^{ - 1} $ factor。这两种方法均无模型，更新状态值和策略同时进行，并在所采集的样本数量中及时运行。我们通过更通用的随机镜下降框架来实现这些结果，以解决单纯形和框域的双线性鞍点问题，我们通过为受约束的MDP提供进一步的应用程序来证明该框架的灵活性。

We present a unified framework based on primal-dual stochastic mirror descent for approximately solving infinite-horizon Markov decision processes (MDPs) given a generative model. When applied to an average-reward MDP with $A_{tot}$ total state-action pairs and mixing time bound $t_{mix}$ our method computes an $ε$-optimal policy with an expected $\widetilde{O}(t_{mix}^2 A_{tot} ε^{-2})$ samples from the state-transition matrix, removing the ergodicity dependence of prior art. When applied to a $γ$-discounted MDP with $A_{tot}$ total state-action pairs our method computes an $ε$-optimal policy with an expected $\widetilde{O}((1-γ)^{-4} A_{tot} ε^{-2})$ samples, matching the previous state-of-the-art up to a $(1-γ)^{-1}$ factor. Both methods are model-free, update state values and policies simultaneously, and run in time linear in the number of samples taken. We achieve these results through a more general stochastic mirror descent framework for solving bilinear saddle-point problems with simplex and box domains and we demonstrate the flexibility of this framework by providing further applications to constrained MDPs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题