在非平稳环境中的近乎最佳目标的强化学习

论文标题

在非平稳环境中的近乎最佳目标的强化学习

Near-Optimal Goal-Oriented Reinforcement Learning in Non-Stationary Environments

论文作者

Chen, Liyu, Luo, Haipeng

论文摘要

我们启动了针对目标导向的强化学习的动态遗憾最小化的研究，该学习是由非平稳随机的最短路径问题模拟的，随着成本和过渡功能的变化。我们首先建立一个下限$ω（（（b _ {\ star} sat _ {\ star}（Δ_c + b _ {\ b _ {\ star}^2Δ_P））^{1/3} k^{2/3} $ $ t _ {\ star} $是从初始状态开始的任何情节最佳政策的最大打击时间，$ sa $是国家行动对的数量，$Δ_C$和$Δ_P$分别是成本和过渡功能的变化数量，而$ k $是情节的数量。 $Δ_C$和$δ_p$在此下限中的不同角色激发了我们设计分别估计成本和过渡的算法。具体而言，假设$δ_c$和$Δ_P$的知识，我们开发了一种简单但优化的算法，而另一种涉及的最小值最佳算法（最多涉及对数项）。这些算法结合了有限马 - 马近似[Chen等，2022a]的思想，MVP算法的特殊伯恩斯坦风格的奖励[Zhang等，2020]，自适应置信度增强[Wei和Luo，Luo，2021]，以及一些新技术，以及一些新的技术，以及一些正常的惩罚性的长期弥补。最后，当$Δ_C$和$δ_p$尚不清楚时，我们开发了主算法的变体[Wei and Luo，2021]，并将上述想法整合到其中以实现$ \ widetilde {o} {o}（\ min \ sin \ min \ {b _ {b _ {b_ {\ star} \ star} s \ s \ s \ sqrt} （b _ {\ star}^2S^2AT _ {\ star}（δ_c+b _ {\ star}Δ_P））^{1/3} k^{2/3} \} \} \} \}）$遗憾，其中$ l $是环境的未知数变化数量。

We initiate the study of dynamic regret minimization for goal-oriented reinforcement learning modeled by a non-stationary stochastic shortest path problem with changing cost and transition functions. We start by establishing a lower bound $Ω((B_{\star} SAT_{\star}(Δ_c + B_{\star}^2Δ_P))^{1/3}K^{2/3})$, where $B_{\star}$ is the maximum expected cost of the optimal policy of any episode starting from any state, $T_{\star}$ is the maximum hitting time of the optimal policy of any episode starting from the initial state, $SA$ is the number of state-action pairs, $Δ_c$ and $Δ_P$ are the amount of changes of the cost and transition functions respectively, and $K$ is the number of episodes. The different roles of $Δ_c$ and $Δ_P$ in this lower bound inspire us to design algorithms that estimate costs and transitions separately. Specifically, assuming the knowledge of $Δ_c$ and $Δ_P$, we develop a simple but sub-optimal algorithm and another more involved minimax optimal algorithm (up to logarithmic terms). These algorithms combine the ideas of finite-horizon approximation [Chen et al., 2022a], special Bernstein-style bonuses of the MVP algorithm [Zhang et al., 2020], adaptive confidence widening [Wei and Luo, 2021], as well as some new techniques such as properly penalizing long-horizon policies. Finally, when $Δ_c$ and $Δ_P$ are unknown, we develop a variant of the MASTER algorithm [Wei and Luo, 2021] and integrate the aforementioned ideas into it to achieve $\widetilde{O}(\min\{B_{\star} S\sqrt{ALK}, (B_{\star}^2S^2AT_{\star}(Δ_c+B_{\star}Δ_P))^{1/3}K^{2/3}\})$ regret, where $L$ is the unknown number of changes of the environment.

下载PDF全文

下载文献需遵守相关版权规定

论文标题