论文标题

三重下降和两种过度拟合:它们在哪里以及为什么出现?

Triple descent and the two kinds of overfitting: Where & why do they appear?

论文作者

d'Ascoli, Stéphane, Sagun, Levent, Biroli, Giulio

论文摘要

最近的一项研究突出了深度学习中存在“双重下降”现象的存在,从而增加培训示例的数量$ n $会导致神经网络的概括误差达到峰值,而$ n $与参数$ p $相同的顺序。在较早的作品中,在诸如线性回归之类的简单模型中显示出类似的现象,当$ n $等于输入尺寸$ d $时,峰值会发生。由于两个峰与插值阈值一致,因此它们通常在垃圾中混合。在本文中,我们表明,尽管它们显然相似,但这两种情况固有地不同。实际上,当将神经网络应用于嘈杂的回归任务时,这两个峰都可以共存。然后,峰的相对大小受激活函数的非线性程度的控制。在分析随机特征模型的最新发展的基础上,我们为样本三重下降提供了理论基础。如前所述,$ n \!= \!p $处的非线性峰是由输出函数对损坏标签的噪声的极端灵敏度和随机特征(或随机特征的初始化(或神经网络中的权重)引起的真正分歧。在没有噪声的情况下,该峰得以生存,但可以通过正则化来抑制。相反,$ n \!= \!d $处的线性峰仅是由于标签中的噪声过度拟合,并在训练期间提早形成。我们表明,这个峰是通过非线性隐式正规化的,这就是为什么它仅在高噪声下变得显着并且受显式正则化的影响。在整个论文中,我们将随机特征模型中获得的分析结果与涉及深神经网络的数值实验的结果进行比较。

A recent line of research has highlighted the existence of a "double descent" phenomenon in deep learning, whereby increasing the number of training examples $N$ causes the generalization error of neural networks to peak when $N$ is of the same order as the number of parameters $P$. In earlier works, a similar phenomenon was shown to exist in simpler models such as linear regression, where the peak instead occurs when $N$ is equal to the input dimension $D$. Since both peaks coincide with the interpolation threshold, they are often conflated in the litterature. In this paper, we show that despite their apparent similarity, these two scenarios are inherently different. In fact, both peaks can co-exist when neural networks are applied to noisy regression tasks. The relative size of the peaks is then governed by the degree of nonlinearity of the activation function. Building on recent developments in the analysis of random feature models, we provide a theoretical ground for this sample-wise triple descent. As shown previously, the nonlinear peak at $N\!=\!P$ is a true divergence caused by the extreme sensitivity of the output function to both the noise corrupting the labels and the initialization of the random features (or the weights in neural networks). This peak survives in the absence of noise, but can be suppressed by regularization. In contrast, the linear peak at $N\!=\!D$ is solely due to overfitting the noise in the labels, and forms earlier during training. We show that this peak is implicitly regularized by the nonlinearity, which is why it only becomes salient at high noise and is weakly affected by explicit regularization. Throughout the paper, we compare analytical results obtained in the random feature model with the outcomes of numerical experiments involving deep neural networks.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源