论文标题
o(1)通过两级梯度平均分布式SGD的通信
O(1) Communication for Distributed SGD through Two-Level Gradient Averaging
论文作者
论文摘要
大型神经网络模型对分布式随机梯度下降(SGD)提出了巨大的沟通挑战,对于N参数的模型,每个工作人员的通信复杂性为O(N)。已经提出了许多稀疏和量化技术来压缩梯度,有些将通信复杂性降低到o(k),其中k << n。在本文中,我们介绍了一种称为“两级梯度平均(A2SGD)”的策略,以将所有梯度合并到每个工作人员的每个工人的两个本地平均值,然后在为更新模型的两个全局平均值计算之前。 A2SGD还保留了局部错误,以维持快速收敛的差异。我们的理论分析表明,A2SGD类似于默认分布式SGD算法的收敛类似。我们的评估验证了理论结论,并证明A2SGD显着降低了每个工人的通信流量,并且与TOP-K和QSGD相比,LSTM-PTB的整体培训时间分别减少了3.2倍和23.2x。据我们所知,A2SGD是第一个实现分布式SGD每个工人的o(1)通信复杂性的人。
Large neural network models present a hefty communication challenge to distributed Stochastic Gradient Descent (SGD), with a communication complexity of O(n) per worker for a model of n parameters. Many sparsification and quantization techniques have been proposed to compress the gradients, some reducing the communication complexity to O(k), where k << n. In this paper, we introduce a strategy called two-level gradient averaging (A2SGD) to consolidate all gradients down to merely two local averages per worker before the computation of two global averages for an updated model. A2SGD also retains local errors to maintain the variance for fast convergence. Our theoretical analysis shows that A2SGD converges similarly like the default distributed SGD algorithm. Our evaluation validates the theoretical conclusion and demonstrates that A2SGD significantly reduces the communication traffic per worker, and improves the overall training time of LSTM-PTB by 3.2x and 23.2x, respectively, compared to Top-K and QSGD. To the best of our knowledge, A2SGD is the first to achieve O(1) communication complexity per worker for distributed SGD.