论文标题
ADAMP:放慢速度的速度优化器在比例不变的权重上
AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights
论文作者
论文摘要
归一化技术是现代深度学习的福音。他们让重量通过通常更好的概括性能更快地收敛。有人认为,重量之间归一化诱导的尺度不变性为梯度下降(GD)优化器提供了有利的理由:随着时间的推移,有效的步骤尺寸会自动降低,从而稳定整体训练程序。然而,通常忽略了GD优化器中动量的额外引入导致比例不变重量的有效步骤尺寸的迅速降低,这一现象尚未研究,并且可能在当前实践中造成了不必要的副作用。这是一个至关重要的问题,因为可以说,绝大多数现代深神经网络由(1)基于动量的GD(例如SGD或ADAM)和(2)(2)规模不变参数组成。在本文中,我们验证了两种成分的广泛补充组合导致有效的步进大小和亚最佳模型性能的过早衰减。我们在每个优化器步骤中提出了一种简单有效的补救措施,SGDP和ADAMP:摆脱径向组件或标准增长方向。由于规模不变性,这种修改只会改变有效的步骤大小而不更改有效的更新说明,从而享受GD优化器的原始收敛属性。鉴于动量GD和机器学习中规模不变性的普遍性,我们已经评估了13个基准的基准的方法。它们的范围从分类(例如ImageNet),检索(例如CUB和SOP)以及检测(例如可可)到语言建模(例如Wikitext)和音频分类(例如Dcase)任务等视觉任务。我们验证我们的解决方案在这些基准测试中带来了统一的收益。源代码可从https://github.com/clovaai/adamp获得。
Normalization techniques are a boon for modern deep learning. They let weights converge more quickly with often better generalization performances. It has been argued that the normalization-induced scale invariance among the weights provides an advantageous ground for gradient descent (GD) optimizers: the effective step sizes are automatically reduced over time, stabilizing the overall training procedure. It is often overlooked, however, that the additional introduction of momentum in GD optimizers results in a far more rapid reduction in effective step sizes for scale-invariant weights, a phenomenon that has not yet been studied and may have caused unwanted side effects in the current practice. This is a crucial issue because arguably the vast majority of modern deep neural networks consist of (1) momentum-based GD (e.g. SGD or Adam) and (2) scale-invariant parameters. In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances. We propose a simple and effective remedy, SGDP and AdamP: get rid of the radial component, or the norm-increasing direction, at each optimizer step. Because of the scale invariance, this modification only alters the effective step sizes without changing the effective update directions, thus enjoying the original convergence properties of GD optimizers. Given the ubiquity of momentum GD and scale invariance in machine learning, we have evaluated our methods against the baselines on 13 benchmarks. They range from vision tasks like classification (e.g. ImageNet), retrieval (e.g. CUB and SOP), and detection (e.g. COCO) to language modelling (e.g. WikiText) and audio classification (e.g. DCASE) tasks. We verify that our solution brings about uniform gains in those benchmarks. Source code is available at https://github.com/clovaai/AdamP.