论文标题
过度参数化神经网络优化算法的动态视图
A Dynamical View on Optimization Algorithms of Overparameterized Neural Networks
论文作者
论文摘要
当配备有效的优化算法时,即使损失函数是非凸面和不平滑的,过度参数化的神经网络也表现出很高的性能。尽管许多作品一直集中在通过梯度下降(GD)训练神经网络来理解损失动态,但在这项工作中,我们考虑了一类广泛的优化算法,这些算法通常在实践中使用。例如,我们从动态系统的角度显示,重球(Hb)方法可以以线性速率(类似于GD)在平均平方误差(MSE)上收敛到全局最小值;但是,Nesterov加速的梯度下降(NAG)只能收敛到全球最低均方根。 我们的结果依赖于随着RELU激活的神经切线内核(NTK)(NTK)和有限的过度参数化神经网络之间的联系,这导致分析有限的普通微分方程(ODE)以进行优化算法。我们表明,优化权重的非凸损失对应于预测误差优化一些强凸损耗。结果,我们可以利用经典凸优化理论来了解神经网络的收敛行为。我们认为,我们的方法也可以扩展到其他优化算法和网络体系结构。
When equipped with efficient optimization algorithms, the over-parameterized neural networks have demonstrated high level of performance even though the loss function is non-convex and non-smooth. While many works have been focusing on understanding the loss dynamics by training neural networks with the gradient descent (GD), in this work, we consider a broad class of optimization algorithms that are commonly used in practice. For example, we show from a dynamical system perspective that the Heavy Ball (HB) method can converge to global minimum on mean squared error (MSE) at a linear rate (similar to GD); however, the Nesterov accelerated gradient descent (NAG) may only converges to global minimum sublinearly. Our results rely on the connection between neural tangent kernel (NTK) and finite over-parameterized neural networks with ReLU activation, which leads to analyzing the limiting ordinary differential equations (ODE) for optimization algorithms. We show that, optimizing the non-convex loss over the weights corresponds to optimizing some strongly convex loss over the prediction error. As a consequence, we can leverage the classical convex optimization theory to understand the convergence behavior of neural networks. We believe our approach can also be extended to other optimization algorithms and network architectures.