线性马尔可夫决策过程的几乎最小值最佳增强学习

论文标题

线性马尔可夫决策过程的几乎最小值最佳增强学习

Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision Processes

论文作者

He, Jiafan, Zhao, Heyang, Zhou, Dongruo, Gu, Quanquan

论文摘要

我们使用线性函数近似研究增强学习（RL）。对于情节不均匀的线性马尔可夫决策过程（线性MDP），其过渡概率可以作为给定功能映射的线性函数参数化，我们提出了第一个实现几乎最小值的遗憾$ \ tilde o（d \ sqrt is的特征，$ d in $ d $ d y $ d y $ d y IS $ d y IS $ d y IS $ d y $ d y $ d y $ d y $ d y $ d y $ d y $ d n $ d y $ d n $ d y IS计划范围和$ k $是情节数量。我们的算法基于具有经过精心设计的权重的加权线性回归方案，该方案取决于（1）直接估算最佳价值函数的方差，（2）单调降低情节的数量，以确保更高的估计准确性，以及（3）使用稀有的转换策略策略的估计值估算效果估计值估算值估算值估算值估算值估算值，该估计值估算值估算值估算值估算值。我们的工作为使用线性MDP的最佳RL提供了完整的答案，并且开发的算法和理论工具可能具有独立的关注。

We study reinforcement learning (RL) with linear function approximation. For episodic time-inhomogeneous linear Markov decision processes (linear MDPs) whose transition probability can be parameterized as a linear function of a given feature mapping, we propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $\tilde O(d\sqrt{H^3K})$, where $d$ is the dimension of the feature mapping, $H$ is the planning horizon, and $K$ is the number of episodes. Our algorithm is based on a weighted linear regression scheme with a carefully designed weight, which depends on a new variance estimator that (1) directly estimates the variance of the optimal value function, (2) monotonically decreases with respect to the number of episodes to ensure a better estimation accuracy, and (3) uses a rare-switching policy to update the value function estimator to control the complexity of the estimated value function class. Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.

下载PDF全文

下载文献需遵守相关版权规定

论文标题