具有动量的自适应SGD的高概率分析

论文标题

具有动量的自适应SGD的高概率分析

A High Probability Analysis of Adaptive SGD with Momentum

论文作者

Li, Xiaoyu, Orabona, Francesco

论文摘要

随机梯度下降（SGD）及其变体是机器学习应用中最常用的算法。特别是，具有自适应学习率和动力的SGD是培训深层网络的行业标准。尽管这些方法取得了巨大的成功，但我们对非凸面设置中这些变体的理论理解并不完整，其中大多数结果仅证明了预期的收敛性，并且对随机梯度有很强的假设。在本文中，我们对自适应和动量算法进行了很高的可能性分析，在对功能，随机梯度和学习率的假设较弱的情况下。我们使用它首次证明了梯度在平滑的非凸设置中以高概率为零的收敛性，以延迟的Adagrad具有动量。

Stochastic Gradient Descent (SGD) and its variants are the most used algorithms in machine learning applications. In particular, SGD with adaptive learning rates and momentum is the industry standard to train deep networks. Despite the enormous success of these methods, our theoretical understanding of these variants in the nonconvex setting is not complete, with most of the results only proving convergence in expectation and with strong assumptions on the stochastic gradients. In this paper, we present a high probability analysis for adaptive and momentum algorithms, under weak assumptions on the function, stochastic gradients, and learning rates. We use it to prove for the first time the convergence of the gradients to zero in high probability in the smooth nonconvex setting for Delayed AdaGrad with momentum.

下载PDF全文

下载文献需遵守相关版权规定

论文标题