论文标题
对元梯度的偏见变化权衡的调查
An Investigation of the Bias-Variance Tradeoff in Meta-Gradients
论文作者
论文摘要
荟萃分数提供了一种一般方法,用于优化增强学习算法的元参数(RL)算法。元梯度的估计对于这些元算法的性能至关重要,并且已经在MAML式的短距离元元RL问题的情况下进行了研究。在这种情况下,先前的工作已经调查了RL目标的Hessian的估计,并通过做出采样校正来解决信贷分配问题,以解决预先适应行为。但是,我们表明,例如由DICE及其变体实施的Hessian估计始终会增加偏见,还可以为元梯度估计增加差异。同时,在重要的长马环境中,研究元梯度估计的研究较少,在这些环境中,通过完整的内部优化轨迹的反向传播是不可行的。我们研究了由截断的反向传播和采样校正引起的偏见和差异权衡,并与进化策略进行了比较,这是最近流行的长途元学习策略。虽然先前的工作隐含地选择了这个偏见 - 差异空间中的点,但我们会散布偏见和差异的来源,并提出一项将现有估计器相互关联的经验研究。
Meta-gradients provide a general approach for optimizing the meta-parameters of reinforcement learning (RL) algorithms. Estimation of meta-gradients is central to the performance of these meta-algorithms, and has been studied in the setting of MAML-style short-horizon meta-RL problems. In this context, prior work has investigated the estimation of the Hessian of the RL objective, as well as tackling the problem of credit assignment to pre-adaptation behavior by making a sampling correction. However, we show that Hessian estimation, implemented for example by DiCE and its variants, always adds bias and can also add variance to meta-gradient estimation. Meanwhile, meta-gradient estimation has been studied less in the important long-horizon setting, where backpropagation through the full inner optimization trajectories is not feasible. We study the bias and variance tradeoff arising from truncated backpropagation and sampling correction, and additionally compare to evolution strategies, which is a recently popular alternative strategy to long-horizon meta-learning. While prior work implicitly chooses points in this bias-variance space, we disentangle the sources of bias and variance and present an empirical study that relates existing estimators to each other.