论文标题
DPRO:一种通用分析和优化系统,用于加快分布式DNN培训
dPRO: A Generic Profiling and Optimization System for Expediting Distributed DNN Training
论文作者
论文摘要
使用多个设备(例如GPU)的分布式培训已被广泛用于大型数据集的DNN模型。但是,在实践中,大规模分布训练的性能往往远非线性加速。鉴于分布式系统的复杂性,当出现意外的低训练速度时,确定效率低下的根本原因和行使有效的性能优化是一项挑战。迄今为止,还没有软件工具可以诊断性能问题并有助于加快分布式DNN培训,而可以使用不同的深度学习框架进行培训。本文提出了DPRO,该工具包包括:(1)有效的剖面仪,该概况仪会在多个框架(尤其是细粒度的通信痕迹)上收集分布式DNN培训的运行时痕迹,并构建全球数据流程图,包括详细的通信操作,以进行准确的重播; (2)优化器有效地识别性能瓶颈并探索用于训练加速的优化策略(来自计算,通信和内存方面)。我们在多个深度学习框架(Tensorflow,MXNET)和代表性通信方案(Alleduce和参数服务器)上实现DPRO。广泛的实验表明,DPRO可以预测在大多数情况下分布式训练在各种情况下的性能,并且在大多数情况下遇到了<5%的错误,并且发现优化策略在基线上加速高达3.48倍。
Distributed training using multiple devices (e.g., GPUs) has been widely adopted for learning DNN models over large datasets. However, the performance of large-scale distributed training tends to be far from linear speed-up in practice. Given the complexity of distributed systems, it is challenging to identify the root cause(s) of inefficiency and exercise effective performance optimizations when unexpected low training speed occurs. To date, there exists no software tool which diagnoses performance issues and helps expedite distributed DNN training, while the training can be run using different deep learning frameworks. This paper proposes dPRO, a toolkit that includes: (1) an efficient profiler that collects runtime traces of distributed DNN training across multiple frameworks, especially fine-grained communication traces, and constructs global data flow graphs including detailed communication operations for accurate replay; (2) an optimizer that effectively identifies performance bottlenecks and explores optimization strategies (from computation, communication, and memory aspects) for training acceleration. We implement dPRO on multiple deep learning frameworks (TensorFlow, MXNet) and representative communication schemes (AllReduce and Parameter Server). Extensive experiments show that dPRO predicts the performance of distributed training in various settings with < 5% errors in most cases and finds optimization strategies with up to 3.48x speed-up over the baselines.