标志梯度下降的几何形状

论文标题

标志梯度下降的几何形状

The Geometry of Sign Gradient Descent

论文作者

Balles, Lukas, Pedregosa, Fabian, Roux, Nicolas Le

论文摘要

基于符号的优化方法由于其在分布式优化方面的良好沟通成本及其在神经网络培训中出奇的良好表现而在机器学习中变得流行。此外，它们与所谓的自适应梯度方法紧密连接。关于标志的最新作品使用了非标准的“可分离平滑度”假设，而与$ \ ell_ \ elfty $ -norm相对于$ \ ell_ \ eeld $ - norm，一些较旧的研究标志梯度下降是最陡峭的下降。在这项工作中，我们通过显示可分离的平滑度与$ \ ell_ \ infty $ -smoothness之间的紧密联系来统一这些现有结果，并认为后者是较弱，更自然的假设。然后，我们继续研究相对于$ \ ell_ \ infty $ norm的平滑度常数，从而隔离了目标函数的几何属性，从而影响基于符号方法的性能。简而言之，如果（i）Hessian在某种程度上集中于其对角线，并且（ii）它的最大特征值大得多，我们发现基于符号的方法比梯度下降更可取。这两种属性在深网中都是常见的。

Sign-based optimization methods have become popular in machine learning due to their favorable communication cost in distributed optimization and their surprisingly good performance in neural network training. Furthermore, they are closely connected to so-called adaptive gradient methods like Adam. Recent works on signSGD have used a non-standard "separable smoothness" assumption, whereas some older works study sign gradient descent as steepest descent with respect to the $\ell_\infty$-norm. In this work, we unify these existing results by showing a close connection between separable smoothness and $\ell_\infty$-smoothness and argue that the latter is the weaker and more natural assumption. We then proceed to study the smoothness constant with respect to the $\ell_\infty$-norm and thereby isolate geometric properties of the objective function which affect the performance of sign-based methods. In short, we find sign-based methods to be preferable over gradient descent if (i) the Hessian is to some degree concentrated on its diagonal, and (ii) its maximal eigenvalue is much larger than the average eigenvalue. Both properties are common in deep networks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题