论文标题
用于分布式和联合学习的最佳梯度压缩
Optimal Gradient Compression for Distributed and Federated Learning
论文作者
论文摘要
在分布式学习和联合学习中的计算节点之间的沟通信息(例如梯度向量)通常是不可避免的负担,从而导致可伸缩性问题。确实,沟通可能会缓慢而昂贵。沟通高效训练算法的最新进展通过使用稀疏,量化或低级别近似的形式使用压缩技术减少了这种瓶颈。由于压缩是一种有损或不精确的过程,因此迭代复杂性通常会恶化。但是,总沟通复杂性可以显着改善,可能导致大量计算时间节省。在本文中,我们研究了编码压缩向量所需的位数与压缩误差所需的位数之间的基本权衡。我们同时进行最坏情况和平均案例分析,提供紧密的下限。在最坏的分析中,我们引入了一个有效的压缩操作员,稀疏的抖动,它非常接近下限。在平均案例分析中,我们设计了一个简单的压缩操作员球形压缩,该压缩自然可以实现下限。因此,我们的新压缩方案显着优于艺术状态。我们进行数值实验以说明这一改进。
Communicating information, like gradient vectors, between computing nodes in distributed and federated learning is typically an unavoidable burden, resulting in scalability issues. Indeed, communication might be slow and costly. Recent advances in communication-efficient training algorithms have reduced this bottleneck by using compression techniques, in the form of sparsification, quantization, or low-rank approximation. Since compression is a lossy, or inexact, process, the iteration complexity is typically worsened; but the total communication complexity can improve significantly, possibly leading to large computation time savings. In this paper, we investigate the fundamental trade-off between the number of bits needed to encode compressed vectors and the compression error. We perform both worst-case and average-case analysis, providing tight lower bounds. In the worst-case analysis, we introduce an efficient compression operator, Sparse Dithering, which is very close to the lower bound. In the average-case analysis, we design a simple compression operator, Spherical Compression, which naturally achieves the lower bound. Thus, our new compression schemes significantly outperform the state of the art. We conduct numerical experiments to illustrate this improvement.