论文标题
桥接分配和对风险敏感的加固学习,可证明的遗憾界限
Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds
论文作者
论文摘要
我们通过分配加强学习(DRL)方法研究了对风险敏感的增强学习(RSRL)的遗憾保证。特别是,我们考虑有限的情节马尔可夫决策过程,其目标是回报的熵风险度量(Entrm)。通过利用进入独立性属性的关键属性,我们建立了对风险敏感的分配动态编程框架。然后,我们提出了两种新型的DRL算法,它们通过两个不同的方案(包括一个基于模型的方案和基于模型的方案)实现乐观。 我们证明他们都达到$ \ tilde {\ Mathcal {o}}(\ frac {\ frac {\ exp(|β| h)-1} {|β|}它与\ cite {fei2021 expartential}中提出的RSVI2与新颖的分布分析相匹配。据我们所知,这是第一个遗憾的分析,它在样本复杂性方面桥接了DRL和RSRL。 认识到与无模型DRL算法相关的计算效率低下,我们提出了一种具有分布表示的替代DRL算法。这种方法不仅保持了既定的遗憾界限,而且可以显着扩大计算效率。 对于$β> 0 $ case,我们还证明了$ω(\ frac {\ frac {\ frac {\ exp(βH/6)-1} {βH} {βH} h \ sqrt {sat})$的更紧密的最小下限,对于$β> 0 $ case,它恢复了$β> 0 $ case的更紧密的最小下限。
We study the regret guarantee for risk-sensitive reinforcement learning (RSRL) via distributional reinforcement learning (DRL) methods. In particular, we consider finite episodic Markov decision processes whose objective is the entropic risk measure (EntRM) of return. By leveraging a key property of the EntRM, the independence property, we establish the risk-sensitive distributional dynamic programming framework. We then propose two novel DRL algorithms that implement optimism through two different schemes, including a model-free one and a model-based one. We prove that they both attain $\tilde{\mathcal{O}}(\frac{\exp(|β| H)-1}{|β|}H\sqrt{S^2AK})$ regret upper bound, where $S$, $A$, $K$, and $H$ represent the number of states, actions, episodes, and the time horizon, respectively. It matches RSVI2 proposed in \cite{fei2021exponential}, with novel distributional analysis. To the best of our knowledge, this is the first regret analysis that bridges DRL and RSRL in terms of sample complexity. Acknowledging the computational inefficiency associated with the model-free DRL algorithm, we propose an alternative DRL algorithm with distribution representation. This approach not only maintains the established regret bounds but also significantly amplifies computational efficiency. We also prove a tighter minimax lower bound of $Ω(\frac{\exp(βH/6)-1}{βH}H\sqrt{SAT})$ for the $β>0$ case, which recovers the tight lower bound $Ω(H\sqrt{SAT})$ in the risk-neutral setting.