论文标题
弹弓机制:自适应优化者和Grokking现象的实证研究
The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon
论文作者
论文摘要
Power等人报道的Grokking现象。 (ARXIV:2201.02177)指的是一个制度,在这种制度中,长期过度拟合之后是突然突然过渡到完美的概括。在本文中,我们试图通过一系列经验研究来揭示Grokking的基础。具体而言,我们在极端的训练阶段(称为弹弓机制)发现了一种优化的异常障碍自适应优化器。可以通过稳定和不稳定的训练方案之间的循环相变来测量弹弓机制的突出伪像,并且可以通过最后一层重量的标准的循环行为轻松监测。我们从经验上观察到,如果没有明确的正则化,如(Arxiv:2201.02177)所报告的几乎完全发生在弹弓的开始时,没有它。虽然在更一般的环境中常见且容易复制,但弹弓机制并不遵循我们所知道的任何已知优化理论,并且可以轻松地忽略而无需深入检查。我们的工作表明,在培训的后期阶段,适应性梯度优化器的令人惊讶且有用的归纳偏见呼吁对其起源进行修订。
The grokking phenomenon as reported by Power et al. ( arXiv:2201.02177 ) refers to a regime where a long period of overfitting is followed by a seemingly sudden transition to perfect generalization. In this paper, we attempt to reveal the underpinnings of Grokking via a series of empirical studies. Specifically, we uncover an optimization anomaly plaguing adaptive optimizers at extremely late stages of training, referred to as the Slingshot Mechanism. A prominent artifact of the Slingshot Mechanism can be measured by the cyclic phase transitions between stable and unstable training regimes, and can be easily monitored by the cyclic behavior of the norm of the last layers weights. We empirically observe that without explicit regularization, Grokking as reported in ( arXiv:2201.02177 ) almost exclusively happens at the onset of Slingshots, and is absent without it. While common and easily reproduced in more general settings, the Slingshot Mechanism does not follow from any known optimization theories that we are aware of, and can be easily overlooked without an in depth examination. Our work points to a surprising and useful inductive bias of adaptive gradient optimizers at late stages of training, calling for a revised theoretical analysis of their origin.