论文标题

小工具:在线资源优化,用于调度所有戒指学习工作

GADGET: Online Resource Optimization for Scheduling Ring-All-Reduce Learning Jobs

论文作者

Yu, Menglu, Tian, Ye, Ji, Bo, Wu, Chuan, Rajan, Hridesh, Liu, Jia

论文摘要

在分布式深度学习的进步(DDL)的推动下,近年来,目睹了对资源密集型分布式/并行计算的需求迅速增长,以处理DDL计算工作。为了解决分布式计算中的网络通信瓶颈和负载平衡问题,已越来越多地采用了所谓的``Ring all-all-Reeduce''''圈子'red-Reduce'',以消除对专用参数服务器的需求。然而,迄今为止,对于如何设计资源优化算法,在计算群集中有效地安排了响铃的DDL作业,仍然缺乏理论上的理解。这激发了我们通过提出一系列新的资源调度设计来填补这一空白,以供所有戒指DDL作业。我们在本文中的贡献是三个方面的:i)我们提出了一个新的资源调度分析模型,用于戒指深度学习,该模型涵盖了DDL绩效优化的广泛目标(例如,过度培训,能源效率,公平性,公平性); ii)基于提出的性能分析模型,我们开发了一种有效的资源调度算法,称为小工具(贪婪的环形 - 全部降低分布图嵌入技术),该算法具有可证明的强大性能保证; iii)我们进行了广泛的痕量驱动实验,以证明小工具方法的有效性及其优越性比艺术状态。

Fueled by advances in distributed deep learning (DDL), recent years have witnessed a rapidly growing demand for resource-intensive distributed/parallel computing to process DDL computing jobs. To resolve network communication bottleneck and load balancing issues in distributed computing, the so-called ``ring-all-reduce'' decentralized architecture has been increasingly adopted to remove the need for dedicated parameter servers. To date, however, there remains a lack of theoretical understanding on how to design resource optimization algorithms for efficiently scheduling ring-all-reduce DDL jobs in computing clusters. This motivates us to fill this gap by proposing a series of new resource scheduling designs for ring-all-reduce DDL jobs. Our contributions in this paper are three-fold: i) We propose a new resource scheduling analytical model for ring-all-reduce deep learning, which covers a wide range of objectives in DDL performance optimization (e.g., excessive training avoidance, energy efficiency, fairness); ii) Based on the proposed performance analytical model, we develop an efficient resource scheduling algorithm called GADGET (greedy ring-all-reduce distributed graph embedding technique), which enjoys a provable strong performance guarantee; iii) We conduct extensive trace-driven experiments to demonstrate the effectiveness of the GADGET approach and its superiority over the state of the art.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源