在随机环境中连续时间和状态的时间差异学习

论文标题

在随机环境中连续时间和状态的时间差异学习

Temporal Difference Learning with Continuous Time and State in the Stochastic Setting

论文作者

Kobeissi, Ziad, Bach, Francis

论文摘要

我们考虑连续时间评估的问题。这在于通过观察来学习与不受控制的连续时间随机动态和奖励函数相关的价值函数。我们使用消失的时间步骤提出了众所周知的TD（0）方法的两个原始变体。一个是无模型的，另一个是基于模型的。对于这两种方法，我们都证明了我们随后通过数值模拟来验证的理论收敛速率。另外，这些方法可以解释为用于近似线性PDE（部分微分方程）或线性BSDE（后向随机微分方程）的新的强化学习方法。

We consider the problem of continuous-time policy evaluation. This consists in learning through observations the value function associated with an uncontrolled continuous-time stochastic dynamic and a reward function. We propose two original variants of the well-known TD(0) method using vanishing time steps. One is model-free and the other is model-based. For both methods, we prove theoretical convergence rates that we subsequently verify through numerical simulations. Alternatively, those methods can be interpreted as novel reinforcement learning approaches for approximating solutions of linear PDEs (partial differential equations) or linear BSDEs (backward stochastic differential equations).

下载PDF全文

下载文献需遵守相关版权规定

论文标题