站点：基于短期的波动率控制策略搜索及其全球融合

论文标题

站点：基于短期的波动率控制策略搜索及其全球融合

STOPS: Short-Term-based Volatility-controlled Policy Search and its Global Convergence

论文作者

Xu, Liangliang, Lyu, Daoming, Pan, Yangchen, Jiang, Aiwen, Liu, Bo

论文摘要

将现有风险的方法部署到现实世界应用程序中仍然具有挑战性。原因是多重的，包括缺乏全球最佳保证以及从长期连续轨迹中学习的必要性。长期连续的轨迹容易涉及来访的危险状态，这在规避风险的环境中是一个主要问题。本文提出了短期波动率控制的政策搜索（Stop），这是一种新型算法，通过从短期轨迹而不是长期轨迹中学习来解决规避风险的问题。短期轨迹更加灵活，可以避免危险的国有探访的危险。通过使用过度参数化的两层神经网络，我们的算法使用参与者 - 批判性方案，以具有近端政策优化和自然政策梯度的全球性速率找到全球最佳政策，其有效性可与最先进的风险中性政策搜索方法相当。该算法对在均值方差评估指标下的挑战性穆约科咖啡机器人仿真任务进行了评估。理论分析和实验结果都表明，在现有的规避风险的策略搜索方法中，停止的最先进水平。

It remains challenging to deploy existing risk-averse approaches to real-world applications. The reasons are multi-fold, including the lack of global optimality guarantee and the necessity of learning from long-term consecutive trajectories. Long-term consecutive trajectories are prone to involving visiting hazardous states, which is a major concern in the risk-averse setting. This paper proposes Short-Term VOlatility-controlled Policy Search (STOPS), a novel algorithm that solves risk-averse problems by learning from short-term trajectories instead of long-term trajectories. Short-term trajectories are more flexible to generate, and can avoid the danger of hazardous state visitations. By using an actor-critic scheme with an overparameterized two-layer neural network, our algorithm finds a globally optimal policy at a sublinear rate with proximal policy optimization and natural policy gradient, with effectiveness comparable to the state-of-the-art convergence rate of risk-neutral policy-search methods. The algorithm is evaluated on challenging Mujoco robot simulation tasks under the mean-variance evaluation metric. Both theoretical analysis and experimental results demonstrate a state-of-the-art level of STOPS' performance among existing risk-averse policy search methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题