在折扣马尔可夫决策过程中，用于均值优化的统一算法框架

论文标题

在折扣马尔可夫决策过程中，用于均值优化的统一算法框架

A unified algorithm framework for mean-variance optimization in discounted Markov decision processes

论文作者

Ma, Shuai, Ma, Xiaoteng, Xia, Li

论文摘要

本文研究了Markov决策过程（MDPS）中无限 - 摩尼克折扣中的规避风险均值变化优化。所涉及的方差指标在整个过程中奖励变异性，并且未来的偏差与其当前价值相比折现。这种打折的均值变化优化产生的奖励功能取决于折扣平均值，而这种依赖性使传统的动态编程方法不适用，因为它抑制了至关重要的属性 - 时间一致性。为了解决这个非正统的问题，我们引入了伪含义，可以将不可经验的MDP转换为具有标准形式重新定义奖励功能的标准MDP，并得出了折现的均值差异性能差异差异公式。使用伪平均值，我们提出了一个统一的算法框架，具有双重优化结构，以进行折扣的均值优化。该框架统一了多种与方差相关的问题的各种算法，包括但不限于规避风险的差异和平均MDP的均值方差和均值变化。此外，文献中缺少的收敛分析也可以与所提出的框架相辅相成。以价值迭代为例，我们开发了一种折现的均值变化价值迭代算法，并借助Bellman local-iptimality方程来证明其融合到本地最佳。最后，我们对投资组合管理进行了数值实验，以验证所提出的算法。

This paper studies the risk-averse mean-variance optimization in infinite-horizon discounted Markov decision processes (MDPs). The involved variance metric concerns reward variability during the whole process, and future deviations are discounted to their present values. This discounted mean-variance optimization yields a reward function dependent on a discounted mean, and this dependency renders traditional dynamic programming methods inapplicable since it suppresses a crucial property -- time consistency. To deal with this unorthodox problem, we introduce a pseudo mean to transform the untreatable MDP to a standard one with a redefined reward function in standard form and derive a discounted mean-variance performance difference formula. With the pseudo mean, we propose a unified algorithm framework with a bilevel optimization structure for the discounted mean-variance optimization. The framework unifies a variety of algorithms for several variance-related problems including, but not limited to, risk-averse variance and mean-variance optimizations in discounted and average MDPs. Furthermore, the convergence analyses missing from the literature can be complemented with the proposed framework as well. Taking the value iteration as an example, we develop a discounted mean-variance value iteration algorithm and prove its convergence to a local optimum with the aid of a Bellman local-optimality equation. Finally, we conduct a numerical experiment on portfolio management to validate the proposed algorithm.

下载PDF全文

下载文献需遵守相关版权规定

论文标题