过度参数化神经网络优化算法的动态视图

论文标题

过度参数化神经网络优化算法的动态视图

A Dynamical View on Optimization Algorithms of Overparameterized Neural Networks

论文作者

Bu, Zhiqi, Xu, Shiyun, Chen, Kan

论文摘要

当配备有效的优化算法时，即使损失函数是非凸面和不平滑的，过度参数化的神经网络也表现出很高的性能。尽管许多作品一直集中在通过梯度下降（GD）训练神经网络来理解损失动态，但在这项工作中，我们考虑了一类广泛的优化算法，这些算法通常在实践中使用。例如，我们从动态系统的角度显示，重球（Hb）方法可以以线性速率（类似于GD）在平均平方误差（MSE）上收敛到全局最小值；但是，Nesterov加速的梯度下降（NAG）只能收敛到全球最低均方根。我们的结果依赖于随着RELU激活的神经切线内核（NTK）（NTK）和有限的过度参数化神经网络之间的联系，这导致分析有限的普通微分方程（ODE）以进行优化算法。我们表明，优化权重的非凸损失对应于预测误差优化一些强凸损耗。结果，我们可以利用经典凸优化理论来了解神经网络的收敛行为。我们认为，我们的方法也可以扩展到其他优化算法和网络体系结构。

When equipped with efficient optimization algorithms, the over-parameterized neural networks have demonstrated high level of performance even though the loss function is non-convex and non-smooth. While many works have been focusing on understanding the loss dynamics by training neural networks with the gradient descent (GD), in this work, we consider a broad class of optimization algorithms that are commonly used in practice. For example, we show from a dynamical system perspective that the Heavy Ball (HB) method can converge to global minimum on mean squared error (MSE) at a linear rate (similar to GD); however, the Nesterov accelerated gradient descent (NAG) may only converges to global minimum sublinearly. Our results rely on the connection between neural tangent kernel (NTK) and finite over-parameterized neural networks with ReLU activation, which leads to analyzing the limiting ordinary differential equations (ODE) for optimization algorithms. We show that, optimizing the non-convex loss over the weights corresponds to optimizing some strongly convex loss over the prediction error. As a consequence, we can leverage the classical convex optimization theory to understand the convergence behavior of neural networks. We believe our approach can also be extended to other optimization algorithms and network architectures.

下载PDF全文

下载文献需遵守相关版权规定

论文标题