基于动量方差的近端随机梯度方法，用于复合非凸随机优化

论文标题

基于动量方差的近端随机梯度方法，用于复合非凸随机优化

Momentum-based variance-reduced proximal stochastic gradient method for composite nonconvex stochastic optimization

论文作者

Xu, Yangyang, Xu, Yibo

论文摘要

随机梯度方法（SGM）已广泛用于解决随机问题或大规模的机器学习问题。最近的工作采用了各种技术来提高凸和非凸病例的SGM的收敛速率。他们中的大多数需要在改进的SGM的某些或所有迭代中大量样品。在本文中，我们提出了一个名为PSTORM的新SGM，用于解决非convex非平滑随机问题。借助基于动量的方差减少技术，PSTORM可以实现最佳复杂性结果$ O（\ Varepsilon^{ - 3}）$，如果固有性的平滑度条件持有，则产生随机$ \ VAREPSILON $平稳的解决方案。与现有的最佳方法不同，PSTOM可以通过在每个更新中仅使用一个或$ O（1）$样本来实现$ {o}（\ varepsilon^{ - 3}）$结果。使用此属性，PSTORM可以应用于在线学习问题，这些问题有利于基于一个或$ O（1）$新观察的实时决策。此外，对于大规模的机器学习问题，与需要大批量训练和香草SGM的其他最佳方法相比，PSTORS可以通过小批量训练更好地概括，因为我们在训练稀疏完全连接的神经网络和稀疏的卷积神经网络方面证明了这一点。

Stochastic gradient methods (SGMs) have been extensively used for solving stochastic problems or large-scale machine learning problems. Recent works employ various techniques to improve the convergence rate of SGMs for both convex and nonconvex cases. Most of them require a large number of samples in some or all iterations of the improved SGMs. In this paper, we propose a new SGM, named PStorm, for solving nonconvex nonsmooth stochastic problems. With a momentum-based variance reduction technique, PStorm can achieve the optimal complexity result $O(\varepsilon^{-3})$ to produce a stochastic $\varepsilon$-stationary solution, if a mean-squared smoothness condition holds. Different from existing optimal methods, PStorm can achieve the ${O}(\varepsilon^{-3})$ result by using only one or $O(1)$ samples in every update. With this property, PStorm can be applied to online learning problems that favor real-time decisions based on one or $O(1)$ new observations. In addition, for large-scale machine learning problems, PStorm can generalize better by small-batch training than other optimal methods that require large-batch training and the vanilla SGM, as we demonstrate on training a sparse fully-connected neural network and a sparse convolutional neural network.

下载PDF全文

下载文献需遵守相关版权规定

论文标题