论文标题
Gradskip:通信加速的本地梯度方法具有更好的计算复杂性
GradSkip: Communication-Accelerated Local Gradient Methods with Better Computational Complexity
论文作者
论文摘要
我们研究一类分布式优化算法,旨在通过允许客户在沟通前执行多个本地梯度型培训步骤来减轻高度沟通成本。在最近的突破中,Mishchenko等。 (2022)证明,当地方培训正确执行后,会导致可证明的沟通加速,这在强烈的凸状态下不依赖任何数据相似性假设。但是,他们的Proxskip方法要求所有客户在每个通信中采取相同数量的本地培训步骤。我们建议对Proxskip方法进行重新设计,允许拥有``较低重要''数据的客户脱离较少的本地培训步骤而不会影响该方法的整体通信复杂性。特别是,我们证明我们的修改方法(GradsKip)在相同的假设下线性收敛,并且具有相同的加速通信复杂性,而相对于局部条件号,可以减少局部梯度步骤的数量。我们通过将概率交替的随机性扩展到任意无偏压缩操作员并考虑使用通用的常规器来进一步概括我们的方法。我们称之为Gradskip+的概括将文献中的几种相关方法恢复为特殊情况。最后,我们介绍了一项关于精心设计的玩具问题的实证研究,以证实我们的理论主张。
We study a class of distributed optimization algorithms that aim to alleviate high communication costs by allowing clients to perform multiple local gradient-type training steps before communication. In a recent breakthrough, Mishchenko et al. (2022) proved that local training, when properly executed, leads to provable communication acceleration, and this holds in the strongly convex regime without relying on any data similarity assumptions. However, their ProxSkip method requires all clients to take the same number of local training steps in each communication round. We propose a redesign of the ProxSkip method, allowing clients with ``less important'' data to get away with fewer local training steps without impacting the overall communication complexity of the method. In particular, we prove that our modified method, GradSkip, converges linearly under the same assumptions and has the same accelerated communication complexity, while the number of local gradient steps can be reduced relative to a local condition number. We further generalize our method by extending the randomness of probabilistic alternations to arbitrary unbiased compression operators and by considering a generic proximable regularizer. This generalization, which we call GradSkip+, recovers several related methods in the literature as special cases. Finally, we present an empirical study on carefully designed toy problems that confirm our theoretical claims.