论文标题
分配加固通过时刻匹配的学习
Distributional Reinforcement Learning via Moment Matching
论文作者
论文摘要
我们考虑从分配加固学习中的经验钟数动力学(RL)中学习一组概率分布的问题,这是一种估计分布的最先进方法,而不是期望的总回报。我们制定了一种方法,该方法通过神经网络从每个返回分布中学习一组有限的统计数据,如(Bellemare,Dabney和Munos 2017; Dabney等人,2018b)。但是,现有的分布RL方法将学习的统计数据限制为\ emph {emph {预定}的返回分布的功能形式,该形式在表示方面既有限制又难以维护预定义的统计信息。取而代之的是,我们通过利用从称为最大平均差异(MMD)的假设检验的技术来了解返回分布的\ emph {无限制}统计量,即确定性(伪)样本,这导致了更简单的物镜,可以使背反射更简单。我们的方法可以被解释为隐式匹配返回分布与其钟形目标之间的所有瞬间订单。我们为分布式贝尔曼操作员的收缩建立了足够的条件,并为确定性样本提供了有限样本分析。 Atari Games套件的实验表明,我们的方法的表现优于标准分配RL基准,并在Atari游戏中为非分配代理创下了新记录。
We consider the problem of learning a set of probability distributions from the empirical Bellman dynamics in distributional reinforcement learning (RL), a class of state-of-the-art methods that estimate the distribution, as opposed to only the expectation, of the total return. We formulate a method that learns a finite set of statistics from each return distribution via neural networks, as in (Bellemare, Dabney, and Munos 2017; Dabney et al. 2018b). Existing distributional RL methods however constrain the learned statistics to \emph{predefined} functional forms of the return distribution which is both restrictive in representation and difficult in maintaining the predefined statistics. Instead, we learn \emph{unrestricted} statistics, i.e., deterministic (pseudo-)samples, of the return distribution by leveraging a technique from hypothesis testing known as maximum mean discrepancy (MMD), which leads to a simpler objective amenable to backpropagation. Our method can be interpreted as implicitly matching all orders of moments between a return distribution and its Bellman target. We establish sufficient conditions for the contraction of the distributional Bellman operator and provide finite-sample analysis for the deterministic samples in distribution approximation. Experiments on the suite of Atari games show that our method outperforms the standard distributional RL baselines and sets a new record in the Atari games for non-distributed agents.