论文标题

部分可观测时空混沌系统的无模型预测

On Scheduling Ring-All-Reduce Learning Jobs in Multi-Tenant GPU Clusters with Communication Contention

论文作者

Yu, Menglu, Ji, Bo, Rajan, Hridesh, Liu, Jia

论文摘要

在深度学习(DL)技术方面的推动下,机器学习和人工智能取得了惊人的成功。但是,对DL的快速增长需求也导致了大规模DL培训的通信和资源密集的分布式培训工作,这些培训通常在GPU群集上部署。为了维持对DL培训的不断增长的需求,最近出现了所谓的“戒指”(RAR)技术,它是一种有利的计算体系结构,以有效地处理GPU群集中的网络通信和计算负载。 RAR最突出的特征是它消除了对专用参数服务器的需求,从而减轻了潜在的通信瓶颈。但是,当在GPU集群上部署多个基于RAR的DL培训工作时,由于DL培训工作之间的争议,仍然可能发生通信瓶颈。到目前为止,对于如何设计基于RAR的DL培训工作的如何设计争议感知资源调度算法的理论了解仍然缺乏理论,这激发了我们在这项工作中填补这一空白。我们的主要贡献是三个方面:i)我们开发了一个新的分析模型,该模型既表征了与工作的工人分布和与不同工作的共同点相关的沟通间接费用; ii)基于提出的分析模型,我们将问题提出为非凸线整数程序,以最大程度地减少所有基于RAR的DL培训工作的Makepan。为了解决此问题中不适合优化算法设计的独特结构,我们将问题重新将整数线性程序重新制定,该计划使可证明的近似算法设计称为SJF-BCO(最小的作业首先具有平衡的争论和架空); iii)我们进行了广泛的实验,以显示SJF-BCO优于现有调度程序的优势。

Powered by advances in deep learning (DL) techniques, machine learning and artificial intelligence have achieved astonishing successes. However, the rapidly growing needs for DL also led to communication- and resource-intensive distributed training jobs for large-scale DL training, which are typically deployed over GPU clusters. To sustain the ever-increasing demand for DL training, the so-called "ring-all-reduce" (RAR) technologies have recently emerged as a favorable computing architecture to efficiently process network communication and computation load in GPU clusters. The most salient feature of RAR is that it removes the need for dedicated parameter servers, thus alleviating the potential communication bottleneck. However, when multiple RAR-based DL training jobs are deployed over GPU clusters, communication bottlenecks could still occur due to contentions between DL training jobs. So far, there remains a lack of theoretical understanding on how to design contention-aware resource scheduling algorithms for RAR-based DL training jobs, which motivates us to fill this gap in this work. Our main contributions are three-fold: i) We develop a new analytical model that characterizes both communication overhead related to the worker distribution of the job and communication contention related to the co-location of different jobs; ii) Based on the proposed analytical model, we formulate the problem as a non-convex integer program to minimize the makespan of all RAR-based DL training jobs. To address the unique structure in this problem that is not amenable for optimization algorithm design, we reformulate the problem into an integer linear program that enables provable approximation algorithm design called SJF-BCO (Smallest Job First with Balanced Contention and Overhead); and iii) We conduct extensive experiments to show the superiority of SJF-BCO over existing schedulers.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源