论文标题
Straggler弹性联盟学习:利用统计准确性与系统异质性之间的相互作用
Straggler-Resilient Federated Learning: Leveraging the Interplay Between Statistical Accuracy and System Heterogeneity
论文作者
论文摘要
联合学习是一种新颖的范式,涉及从数据样本中学习,该数据样本分布在大型客户网络中,而数据仍然是本地的。但是,众所周知,联邦学习容易应对多种系统挑战,包括客户具有不同的计算和通信功能的系统异质性。客户的计算速度中的这种异质性对联合学习算法的可伸缩性产生负面影响,并且由于存在散乱者而导致其运行时的大幅下降。在本文中,我们提出了一种新颖的散乱的联合学习方法,该方法结合了客户数据的统计特征,以适应性地选择客户,以加快学习过程。我们算法的关键思想是,一旦达到与当前参与节点相对应的数据的统计准确性,在模型培训中启动训练过程,并逐渐涉及模型培训中的较慢节点。所提出的方法降低了实现所有节点数据统计准确性所需的总运行时,因为每个阶段的解决方案都接近随后阶段的解决方案,并可以用作更多样品,并且可以用作温暖的启动。我们的理论结果表征了与强烈凸目标的标准联合基准相比,加速增长的特征,我们的数值实验还表明,与联邦学习基准相比,在散漫持久方法的墙壁锁定时间中,我们还表明了显着的加速。
Federated Learning is a novel paradigm that involves learning from data samples distributed across a large network of clients while the data remains local. It is, however, known that federated learning is prone to multiple system challenges including system heterogeneity where clients have different computation and communication capabilities. Such heterogeneity in clients' computation speeds has a negative effect on the scalability of federated learning algorithms and causes significant slow-down in their runtime due to the existence of stragglers. In this paper, we propose a novel straggler-resilient federated learning method that incorporates statistical characteristics of the clients' data to adaptively select the clients in order to speed up the learning procedure. The key idea of our algorithm is to start the training procedure with faster nodes and gradually involve the slower nodes in the model training once the statistical accuracy of the data corresponding to the current participating nodes is reached. The proposed approach reduces the overall runtime required to achieve the statistical accuracy of data of all nodes, as the solution for each stage is close to the solution of the subsequent stage with more samples and can be used as a warm-start. Our theoretical results characterize the speedup gain in comparison to standard federated benchmarks for strongly convex objectives, and our numerical experiments also demonstrate significant speedups in wall-clock time of our straggler-resilient method compared to federated learning benchmarks.