华尔街树搜索：离线加固学习的风险意识计划

论文标题

华尔街树搜索：离线加固学习的风险意识计划

Wall Street Tree Search: Risk-Aware Planning for Offline Reinforcement Learning

论文作者

Elbaz, Dan, Novik, Gal, Salzman, Oren

论文摘要

离线增强学习（RL）算法学会使用给定的固定培训数据集做出决策，而无需在线数据收集。此问题设置令人着迷，因为它具有利用先前收集的数据集的承诺，而与环境相互作用而没有任何昂贵或冒险的互动。但是，由于限制数据集会引起不确定性，因此这一承诺也具有此设置的缺点，因为代理可以遇到培训数据未涵盖的状态和行动的陌生序列。为了减轻破坏性的不确定性效果，我们需要平衡愿望，以便将奖励最大化的行动与由于不正确的风险而产生的风险。在金融经济学中，现代投资组合理论（MPT）是一种规避风险的投资者可以用来构建多元化投资组合的方法，这些投资组合可以最大程度地提高其回报而不会获得不可接受的风险。我们建议将MPT集成到代理商的决策过程中，并为离线RL提出了一种新的简单尚未有效的风险感知的计划算法。我们的算法使我们能够系统地说明特定动作的\ emph {估计质量}及其\ emph {估计的风险}，这是由于不确定性而导致的。我们表明，我们的方法可以与变压器体系结构相结合，以产生最先进的计划者，从而最大程度地提高了离线RL任务的回报。此外，与传统的变压器解码相比，我们的算法大大降低了结果的差异，这导致了更稳定的算法 - 这种属性对于离线RL设置至关重要，在离线RL设置中，现实世界中的探索和故障可能是昂贵或危险的。

Offline reinforcement-learning (RL) algorithms learn to make decisions using a given, fixed training dataset without online data collection. This problem setting is captivating because it holds the promise of utilizing previously collected datasets without any costly or risky interaction with the environment. However, this promise also bears the drawback of this setting as the restricted dataset induces uncertainty because the agent can encounter unfamiliar sequences of states and actions that the training data did not cover. To mitigate the destructive uncertainty effects, we need to balance the aspiration to take reward-maximizing actions with the incurred risk due to incorrect ones. In financial economics, modern portfolio theory (MPT) is a method that risk-averse investors can use to construct diversified portfolios that maximize their returns without unacceptable levels of risk. We propose integrating MPT into the agent's decision-making process, presenting a new simple-yet-highly-effective risk-aware planning algorithm for offline RL. Our algorithm allows us to systematically account for the \emph{estimated quality} of specific actions and their \emph{estimated risk} due to the uncertainty. We show that our approach can be coupled with the Transformer architecture to yield a state-of-the-art planner, which maximizes the return for offline RL tasks. Moreover, our algorithm reduces the variance of the results significantly compared to conventional Transformer decoding, which results in a much more stable algorithm -- a property that is essential for the offline RL setting, where real-world exploration and failures can be costly or dangerous.

下载PDF全文

下载文献需遵守相关版权规定

论文标题