深度学习模型的分布式培训：分类观点

论文标题

深度学习模型的分布式培训：分类观点

Distributed Training of Deep Learning Models: A Taxonomic Perspective

论文作者

Langer, Matthias, He, Zhen, Rahayu, Wenny, Xue, Yanbo

论文摘要

分布式深度学习系统（DDLS）通过利用集群的分布式资源来训练深神网络模型。 DDL的开发人员必须做出许多决定，以有效地在其选择的环境中处理其特定的工作量。基于GPU的深度学习的出现，数据集和深度神经网络模型的不断增长，结合了集群环境中存在的带宽约束，需要DDLS的开发人员具有创新性，以便快速训练高质量的模型。由于其广泛的功能清单和建筑偏差，很难将DDLS并排比较。我们的目标是通过分析与培训深度学习模型相关的一般属性以及如何在集群中分发这些工作负载以实现协作模型培训时，旨在阐明在培训一系列独立机器中培训深层神经网络时正在起作用的基本原理。因此，我们概述了当代DDLS使用的不同技术，并讨论了它们对培训过程的影响和影响。为了概念化和比较DDL，我们将不同的技术分为类别，从而建立了分布式深度学习系统的分类法。

Distributed deep learning systems (DDLS) train deep neural network models by utilizing the distributed resources of a cluster. Developers of DDLS are required to make many decisions to process their particular workloads in their chosen environment efficiently. The advent of GPU-based deep learning, the ever-increasing size of datasets and deep neural network models, in combination with the bandwidth constraints that exist in cluster environments require developers of DDLS to be innovative in order to train high quality models quickly. Comparing DDLS side-by-side is difficult due to their extensive feature lists and architectural deviations. We aim to shine some light on the fundamental principles that are at work when training deep neural networks in a cluster of independent machines by analyzing the general properties associated with training deep learning models and how such workloads can be distributed in a cluster to achieve collaborative model training. Thereby we provide an overview of the different techniques that are used by contemporary DDLS and discuss their influence and implications on the training process. To conceptualize and compare DDLS, we group different techniques into categories, thus establishing a taxonomy of distributed deep learning systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题