论文标题
关于过度参数化线性回归的自适应方法的概括
On Generalization of Adaptive Methods for Over-parameterized Linear Regression
论文作者
论文摘要
在过去的十年中,过度参数和适应性方法在深度学习的成功中起着至关重要的作用。过度参数化的广泛使用迫使我们通过提出新现象来重新思考概括,例如对优化算法的隐式正规化和随着训练进展的双重下降。最近的一系列作品开始阐明这些领域以了解这些领域 - 为什么神经网络概括了?过度参数的线性回归的设置为理解神经网络的这种神秘行为提供了关键的见解。 在本文中,我们旨在表征过度参数化的线性回归设置中自适应方法的性能。首先,我们专注于两个自适应方法的子类,具体取决于它们的概括性能。对于第一类自适应方法,参数矢量保留在数据的跨度中,并收敛到最小规范解决方案,例如梯度下降(GD)。另一方面,对于第二类自适应方法,由调节器矩阵引起的梯度旋转会导致参数矢量的跨度组件收敛到最小规范溶液和饱和的跨度分量。我们对过度参数的线性回归和深层神经网络的实验支持了这一理论。
Over-parameterization and adaptive methods have played a crucial role in the success of deep learning in the last decade. The widespread use of over-parameterization has forced us to rethink generalization by bringing forth new phenomena, such as implicit regularization of optimization algorithms and double descent with training progression. A series of recent works have started to shed light on these areas in the quest to understand -- why do neural networks generalize well? The setting of over-parameterized linear regression has provided key insights into understanding this mysterious behavior of neural networks. In this paper, we aim to characterize the performance of adaptive methods in the over-parameterized linear regression setting. First, we focus on two sub-classes of adaptive methods depending on their generalization performance. For the first class of adaptive methods, the parameter vector remains in the span of the data and converges to the minimum norm solution like gradient descent (GD). On the other hand, for the second class of adaptive methods, the gradient rotation caused by the pre-conditioner matrix results in an in-span component of the parameter vector that converges to the minimum norm solution and the out-of-span component that saturates. Our experiments on over-parameterized linear regression and deep neural networks support this theory.