培训深网的稀疏沟通

论文标题

培训深网的稀疏沟通

Sparse Communication for Training Deep Networks

论文作者

Eghlidi, Negar Foroutan, Jaggi, Martin

论文摘要

同步随机梯度下降（SGD）是用于深度学习模型分布式训练的最常见方法。在此算法中，每个工人都与他人共享其本地梯度，并使用所有工人的平均梯度更新参数。尽管分布式培训减少了计算时间，但与梯度交换相关的通信开销构成了该算法的可扩展性瓶颈。提出了许多压缩技术，以减少需要传达的梯度数量。但是，压缩梯度引入了问题的另一个开销。在这项工作中，我们研究了几种压缩方案，并确定三个关键参数如何影响性能。我们还提供了一系列有关如何提高性能并引入简单的稀疏方案（随机块稀疏）的见解，从而减少了沟通，同时使性能接近标准SGD。

Synchronous stochastic gradient descent (SGD) is the most common method used for distributed training of deep learning models. In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers. Although distributed training reduces the computation time, the communication overhead associated with the gradient exchange forms a scalability bottleneck for the algorithm. There are many compression techniques proposed to reduce the number of gradients that needs to be communicated. However, compressing the gradients introduces yet another overhead to the problem. In this work, we study several compression schemes and identify how three key parameters affect the performance. We also provide a set of insights on how to increase performance and introduce a simple sparsification scheme, random-block sparsification, that reduces communication while keeping the performance close to standard SGD.

下载PDF全文

下载文献需遵守相关版权规定

论文标题