和谐：克服GPU内存能力的障碍，以训练大型DNN模型在商品服务器上

论文标题

和谐：克服GPU内存能力的障碍，以训练大型DNN模型在商品服务器上

Harmony: Overcoming the Hurdles of GPU Memory Capacity to Train Massive DNN Models on Commodity Servers

论文作者

Li, Youjie, Phanishayee, Amar, Murray, Derek, Tarnawski, Jakub, Kim, Nam Sung

论文摘要

在过去的十年中，深度神经网络（DNNS）的规模成倍增长，只剩下那些具有大量基于数据中心的资源的人具有开发和培训此类模型的能力。对于可能只有有限的资源（例如，单个多GPU服务器）的研究人员的长尾巴的主要挑战之一是GPU内存能力与模型大小相比有限。问题是如此严重，以至于训练大规模DNN模型的内存需求通常可以超过单个服务器上所有可用GPU的总容量；这个问题只会随着型号不断增长的趋势而变得更糟。当前依赖于虚拟化GPU内存的解决方案（通过向CPU内存交换/从CPU内存）会产生过多的交换开销。在本文中，我们提出了一个新的培训框架，和谐和倡导者，重新思考了DNN框架如何安排计算并移动数据以在单个商品服务器上有效地推动培训大规模模型的边界。在各种大型DNN模型中，Harmony能够将交换负载最多减少两个数量级，并在具有虚拟化内存的高度优化基线上获得高达7.6倍的训练吞吐量加速。

Deep neural networks (DNNs) have grown exponentially in size over the past decade, leaving only those who have massive datacenter-based resources with the ability to develop and train such models. One of the main challenges for the long tail of researchers who might have only limited resources (e.g., a single multi-GPU server) is limited GPU memory capacity compared to model size. The problem is so acute that the memory requirement of training massive DNN models can often exceed the aggregate capacity of all available GPUs on a single server; this problem only gets worse with the trend of ever-growing model sizes. Current solutions that rely on virtualizing GPU memory (by swapping to/from CPU memory) incur excessive swapping overhead. In this paper, we present a new training framework, Harmony, and advocate rethinking how DNN frameworks schedule computation and move data to push the boundaries of training massive models efficiently on a single commodity server. Across various massive DNN models, Harmony is able to reduce swap load by up to two orders of magnitude and obtain a training throughput speedup of up to 7.6x over highly optimized baselines with virtualized memory.

下载PDF全文

下载文献需遵守相关版权规定

论文标题