在大型变压器模型中减少激活重新计算

论文标题

在大型变压器模型中减少激活重新计算

Reducing Activation Recomputation in Large Transformer Models

论文作者

Korthikanti, Vijay, Casper, Jared, Lym, Sangkug, McAfee, Lawrence, Andersch, Michael, Shoeybi, Mohammad, Catanzaro, Bryan

论文摘要

训练大型变压器模型是现代AI最重要的计算挑战之一。在本文中，我们通过减少激活重新计算来展示如何显着加速大型变压器模型的训练。激活重新计算通常用于围绕记忆容量限制。传统上，它们没有存储反向传播的激活，而是保存记忆，但增加了冗余的计算。在这项工作中，我们表明大多数冗余计算是不必要的，因为没有它，我们可以充分减少内存消耗。我们提出了两种新颖但又非常简单的技术：序列并行性和选择性激活重新计算。结合张量并行性，这些技术几乎消除了重新计算激活的需求。我们在语言模型上评估我们的方法最高为1万亿个参数，并表明我们的方法将激活记忆减少了5倍，同时将激活重新构成的执行时间开销降低了90％以上。例如，当在2240 NVIDIA A100 GPU上训练530b参数GPT-3样式模型时，我们实现了54.2％的模型Flops利用率，比使用重新构成实现的42.1％快29％。我们的实施将在Megatron-LM和Nemo-Megatron中提供。

Training large transformer models is one of the most important computational challenges of modern AI. In this paper, we show how to significantly accelerate training of large transformer models by reducing activation recomputation. Activation recomputation is commonly used to work around memory capacity constraints. Rather than storing activations for backpropagation, they are traditionally recomputed, which saves memory but adds redundant compute. In this work, we show most of this redundant compute is unnecessary because we can reduce memory consumption sufficiently without it. We present two novel yet very simple techniques: sequence parallelism and selective activation recomputation. In conjunction with tensor parallelism, these techniques almost eliminate the need to recompute activations. We evaluate our approach on language models up to one trillion parameters in scale and show that our method reduces activation memory by 5x, while reducing execution time overhead from activation recomputation by over 90%. For example, when training a 530B parameter GPT-3 style model on 2240 NVIDIA A100 GPUs, we achieve a Model Flops Utilization of 54.2%, which is 29% faster than the 42.1% we achieve using recomputation. Our implementation will be available in both Megatron-LM and NeMo-Megatron.

下载PDF全文

下载文献需遵守相关版权规定

论文标题