论文标题

神经网络中的插值相变:懒惰训练下的记忆和概括

The Interpolation Phase Transition in Neural Networks: Memorization and Generalization under Lazy Training

论文作者

Montanari, Andrea, Zhong, Yiqiao

论文摘要

现代神经网络通常以强烈的过度兼容性制度运行:它们包含许多参数,即使实际标签被纯粹随机的标签取代,它们也可以插入训练集。尽管如此,他们在看不见的数据上达到了良好的预测错误:插值训练集并不会导致较大的概括错误。此外,过度散色化似乎是有益的,因为它简化了优化格局。在这里,我们在神经切线(NT)制度中的两层神经网络的背景下研究这些现象。我们考虑一个简单的数据模型,以及各向同性协变量的矢量$ d $ dimensions和$ n $隐藏的神经元。我们假设样本量$ n $和尺寸$ d $都很大,并且它们在多项式上相关。我们的第一个主要结果是对过份术的经验NT内核的特征结构的特征。这种表征意味着必然的表明,经验NT内核的最低特征值在$ ND \ gg n $后立即从零开始,因此网络可以在同一制度中精确插入任意标签。 我们的第二个主要结果是对NT Ridge回归的概括误差的表征,包括特殊情况,是最小的-ULL_2 $ NORD插值。我们证明,一旦$ nd \ gg n $,测试误差就会被内核岭回归之一相对于无限宽度内核的近似。多项式脊回归的误差依次近似后者,从而通过与激活函数的高度组件相关的“自诱导的”项增加了正则化参数。多项式程度取决于样本量和维度(尤其是$ \ log n/\ log d $)。

Modern neural networks are often operated in a strongly overparametrized regime: they comprise so many parameters that they can interpolate the training set, even if actual labels are replaced by purely random ones. Despite this, they achieve good prediction error on unseen data: interpolating the training set does not lead to a large generalization error. Further, overparametrization appears to be beneficial in that it simplifies the optimization landscape. Here we study these phenomena in the context of two-layers neural networks in the neural tangent (NT) regime. We consider a simple data model, with isotropic covariates vectors in $d$ dimensions, and $N$ hidden neurons. We assume that both the sample size $n$ and the dimension $d$ are large, and they are polynomially related. Our first main result is a characterization of the eigenstructure of the empirical NT kernel in the overparametrized regime $Nd\gg n$. This characterization implies as a corollary that the minimum eigenvalue of the empirical NT kernel is bounded away from zero as soon as $Nd\gg n$, and therefore the network can exactly interpolate arbitrary labels in the same regime. Our second main result is a characterization of the generalization error of NT ridge regression including, as a special case, min-$\ell_2$ norm interpolation. We prove that, as soon as $Nd\gg n$, the test error is well approximated by the one of kernel ridge regression with respect to the infinite-width kernel. The latter is in turn well approximated by the error of polynomial ridge regression, whereby the regularization parameter is increased by a `self-induced' term related to the high-degree components of the activation function. The polynomial degree depends on the sample size and the dimension (in particular on $\log n/\log d$).

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源