非凸优化的同型SGD的收敛分析

论文标题

非凸优化的同型SGD的收敛分析

Convergence Analysis of Homotopy-SGD for non-convex optimization

论文作者

Gargiani, Matilde, Zanelli, Andrea, Tran-Dinh, Quoc, Diehl, Moritz, Hutter, Frank

论文摘要

用于解决大规模非凸优化问题的一阶随机方法在许多大数据应用中广泛使用，例如培训深层神经网络以及其他复杂且潜在的非凸机机器学习模型。它们廉价的迭代通常会以缓慢的全球收敛速率（主要是均匀的）融合在一起，导致有必要进行大量迭代，然后迭代到达最小化器的邻居。在这项工作中，我们基于同型方法和SGD的组合提出了一阶随机算法，称为同型梯度下降（H-SGD），该梯度下降（H-SGD）与文献中一些提出的启发式方法相关联，例如。高斯延续的优化，通过扩散训练，运动网络的培训。在对问题结构的一些轻度假设下，我们对所提出的算法进行了理论分析。我们的分析表明，借助针对同质参数设计的专门设计的方案，H-SGD享有全球线性收敛速率，同时保持快速且廉价的迭代。实验评估证实了理论结果，并表明H-SGD的表现可以胜过标准SGD。

First-order stochastic methods for solving large-scale non-convex optimization problems are widely used in many big-data applications, e.g. training deep neural networks as well as other complex and potentially non-convex machine learning models. Their inexpensive iterations generally come together with slow global convergence rate (mostly sublinear), leading to the necessity of carrying out a very high number of iterations before the iterates reach a neighborhood of a minimizer. In this work, we present a first-order stochastic algorithm based on a combination of homotopy methods and SGD, called Homotopy-Stochastic Gradient Descent (H-SGD), which finds interesting connections with some proposed heuristics in the literature, e.g. optimization by Gaussian continuation, training by diffusion, mollifying networks. Under some mild assumptions on the problem structure, we conduct a theoretical analysis of the proposed algorithm. Our analysis shows that, with a specifically designed scheme for the homotopy parameter, H-SGD enjoys a global linear rate of convergence to a neighborhood of a minimum while maintaining fast and inexpensive iterations. Experimental evaluations confirm the theoretical results and show that H-SGD can outperform standard SGD.

下载PDF全文

下载文献需遵守相关版权规定

论文标题