基准测试资源使用量以进行有效的分布式深度学习

论文标题

基准测试资源使用量以进行有效的分布式深度学习

Benchmarking Resource Usage for Efficient Distributed Deep Learning

论文作者

Frey, Nathan C., Li, Baolin, McDonald, Joseph, Zhao, Dan, Jones, Michael, Bestor, David, Tiwari, Devesh, Gadepally, Vijay, Samsi, Siddharth

论文摘要

深度学习（DL）工作流程要求越来越多的计算预算和能量预算，以实现巨大的收益。神经建筑搜索，超参数扫描和快速原型型消耗巨大的资源，这些资源可以防止资源受限的研究人员尝试大型模型并带来相当大的环境影响。因此，必须了解不同的深层神经网络（DNN）和培训利用的计算和能源资源的利益如何，尤其是跨不同领域和应用程序的专门计算密集型模型。在本文中，我们进行了3,400多个实验培训一系列代表各种领域/任务的深网 - 自然语言处理，计算机视觉和化学 - 最多424个图形处理单元（GPU）。在培训期间，我们的实验系统地改变了计算资源特征和节能机制，例如功率利用率和GPU时钟速率限制，以捕获和说明各种代表性模型在各种资源和能源约束的方案下都展示了不同的权衡和缩放行为。我们符合功率法律模型，这些模型描述了如何使用可用的计算资源和能源限制来训练时间尺度。我们预计，这些发现将有助于通过选择性地减少不同深度学习任务/工作流程的能源消耗，从而帮助和指导高性能计算提供商优化资源利用率，对培训的影响最小。

Deep learning (DL) workflows demand an ever-increasing budget of compute and energy in order to achieve outsized gains. Neural architecture searches, hyperparameter sweeps, and rapid prototyping consume immense resources that can prevent resource-constrained researchers from experimenting with large models and carry considerable environmental impact. As such, it becomes essential to understand how different deep neural networks (DNNs) and training leverage increasing compute and energy resources -- especially specialized computationally-intensive models across different domains and applications. In this paper, we conduct over 3,400 experiments training an array of deep networks representing various domains/tasks -- natural language processing, computer vision, and chemistry -- on up to 424 graphics processing units (GPUs). During training, our experiments systematically vary compute resource characteristics and energy-saving mechanisms such as power utilization and GPU clock rate limits to capture and illustrate the different trade-offs and scaling behaviors each representative model exhibits under various resource and energy-constrained regimes. We fit power law models that describe how training time scales with available compute resources and energy constraints. We anticipate that these findings will help inform and guide high-performance computing providers in optimizing resource utilization, by selectively reducing energy consumption for different deep learning tasks/workflows with minimal impact on training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题