马尔可夫决策过程中的风险约束政策的强化学习

论文标题

马尔可夫决策过程中的风险约束政策的强化学习

Reinforcement Learning of Risk-Constrained Policies in Markov Decision Processes

论文作者

Brazdil, Tomas, Chatterjee, Krishnendu, Novotny, Petr, Vahala, Jiri

论文摘要

马尔可夫决策过程（MDP）是在存在构成不确定性的情况下进行顺序决策的DefaTso框架。经典的优化标准表格DAS是最大化预期的折扣和回报，该折扣和损害事件忽略了对系统的灾难性事件的负面影响。另一方面，规避风险的政策要求不可避免的范围低于给定的阈值，但它们不考虑预期收益的优化。我们认为MDPSWITH的折扣和收益与失败状态有关，这些状态占据了灾难性的结果。风险限制规划的目的是最大化预期的折扣-AM-the offamong风险避免风险的政策，以确保失败状态的概率低于所需的阈值。我们的主要贡献是一种有效的风险限制的计划算法 - 算法，将类似UCT的搜索与与MDP（以Alphazero的风格）进行的预测相互作用结合在一起，并通过线性启动启动搜索进行了风险受限的动作选择。我们证明了我们的方法与文献中经典MDP的实验的有效性，以10^6个状态的顺序划分的基准。

Markov decision processes (MDPs) are the defacto frame-work for sequential decision making in the presence ofstochastic uncertainty. A classical optimization criterion forMDPs is to maximize the expected discounted-sum pay-off, which ignores low probability catastrophic events withhighly negative impact on the system. On the other hand,risk-averse policies require the probability of undesirableevents to be below a given threshold, but they do not accountfor optimization of the expected payoff. We consider MDPswith discounted-sum payoff with failure states which repre-sent catastrophic outcomes. The objective of risk-constrainedplanning is to maximize the expected discounted-sum payoffamong risk-averse policies that ensure the probability to en-counter a failure state is below a desired threshold. Our maincontribution is an efficient risk-constrained planning algo-rithm that combines UCT-like search with a predictor learnedthrough interaction with the MDP (in the style of AlphaZero)and with a risk-constrained action selection via linear pro-gramming. We demonstrate the effectiveness of our approachwith experiments on classical MDPs from the literature, in-cluding benchmarks with an order of 10^6 states.

下载PDF全文

下载文献需遵守相关版权规定

论文标题