马尔可夫决策过程的基于间隔优势的结构结果

论文标题

马尔可夫决策过程的基于间隔优势的结构结果

Interval Dominance based Structural Results for Markov Decision Process

论文作者

Krishnamurthy, Vikram

论文摘要

结构性结果对马尔可夫决策过程（MDP）的模型参数施加了足够的条件，因此最佳策略是基础状态的增加功能。 MDP结构结果的经典假设需要奖励和过渡概率的超模型。但是，在许多应用中，超模型不存在。本文在微观经济学文献中使用了足够的条件来进行间隔优势（称为I），以在更一般的条件下获得MDP的结构结果。我们提出了几个超模型尚未达到的MDP示例，但我持有，因此最佳政策是单调的。这些包括sigmoidal奖励（在人类决策的前景理论中产生），双基因和扰动的双基因分节过渡矩阵（在最佳分配问题中）。我们还考虑具有TP3过渡矩阵和凹值函数的MDP。最后，讨论了利用最佳单调政策差异稀疏结构的增强学习算法。

Structural results impose sufficient conditions on the model parameters of a Markov decision process (MDP) so that the optimal policy is an increasing function of the underlying state. The classical assumptions for MDP structural results require supermodularity of the rewards and transition probabilities. However, supermodularity does not hold in many applications. This paper uses a sufficient condition for interval dominance (called I) proposed in the microeconomics literature, to obtain structural results for MDPs under more general conditions. We present several MDP examples where supermodularity does not hold, yet I holds, and so the optimal policy is monotone; these include sigmoidal rewards (arising in prospect theory for human decision making), bi-diagonal and perturbed bi-diagonal transition matrices (in optimal allocation problems). We also consider MDPs with TP3 transition matrices and concave value functions. Finally, reinforcement learning algorithms that exploit the differential sparse structure of the optimal monotone policy are discussed.

下载PDF全文

下载文献需遵守相关版权规定

论文标题