乐观的MLE-一种基于通用模型的算法，用于部分可观察到的顺序决策

论文标题

乐观的MLE-一种基于通用模型的算法，用于部分可观察到的顺序决策

Optimistic MLE -- A Generic Model-based Algorithm for Partially Observable Sequential Decision Making

论文作者

Liu, Qinghua, Netrapalli, Praneeth, Szepesvári, Csaba, Jin, Chi

论文摘要

本文介绍了一种简单的有效学习算法，用于一般顺序决策。该算法将探索的乐观与模型估计的最大似然估计相结合，因此被命名为OMLE。我们证明，Omle了解了多项式数量的样本中的一系列非常丰富的顺序决策问题的近乎最佳策略。这个丰富的阶级不仅包括大多数已知的基于模型的强化增强学习（例如表格MDP，计算的MDPS，低的见证人等级问题，弱弱的/可观察到的POMDP和多步pomdps）和多个可解码的POMDP），但在许多新的挑战性RL问题中，尤其是在以前不明显的设置中，尤其是在许多新的挑战性RL问题中都无法解决。值得注意的是，本文解决的新问题包括（1）具有连续观察和功能近似的可观察到的POMDP，我们达到了完全独立于观察空间的第一个样品复杂性；（2）条件良好的低级顺序决策问题（也称为预测状态表示（PSRS）），其中包括并概括了所有已知的可牵引的POMDP示例，这些示例在更固有的表示下；（3）在帆条件下进行一般顺序决策问题，这统一了我们在完全可观察和部分可观察到的设置中对基于模型的RL的现有理解。帆条件是由本文确定的，可以将其视为贝尔曼/证人等级的自然概括，以解决部分可观察性。本文还提出了OMLE算法的无奖励变体，该变体学习了近似动态模型，这些模型可以同时计算所有奖励函数的近距离策略。

This paper introduces a simple efficient learning algorithms for general sequential decision making. The algorithm combines Optimism for exploration with Maximum Likelihood Estimation for model estimation, which is thus named OMLE. We prove that OMLE learns the near-optimal policies of an enormously rich class of sequential decision making problems in a polynomial number of samples. This rich class includes not only a majority of known tractable model-based Reinforcement Learning (RL) problems (such as tabular MDPs, factored MDPs, low witness rank problems, tabular weakly-revealing/observable POMDPs and multi-step decodable POMDPs), but also many new challenging RL problems especially in the partially observable setting that were not previously known to be tractable. Notably, the new problems addressed by this paper include (1) observable POMDPs with continuous observation and function approximation, where we achieve the first sample complexity that is completely independent of the size of observation space; (2) well-conditioned low-rank sequential decision making problems (also known as Predictive State Representations (PSRs)), which include and generalize all known tractable POMDP examples under a more intrinsic representation; (3) general sequential decision making problems under SAIL condition, which unifies our existing understandings of model-based RL in both fully observable and partially observable settings. SAIL condition is identified by this paper, which can be viewed as a natural generalization of Bellman/witness rank to address partial observability. This paper also presents a reward-free variant of OMLE algorithm, which learns approximate dynamic models that enable the computation of near-optimal policies for all reward functions simultaneously.

下载PDF全文

下载文献需遵守相关版权规定

论文标题