论文标题

深速推理:以前所未有的规模启用变压器模型的有效推断

DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

论文作者

Aminabadi, Reza Yazdani, Rajbhandari, Samyam, Zhang, Minjia, Awan, Ammar Ahmad, Li, Cheng, Li, Du, Zheng, Elton, Rasley, Jeff, Smith, Shaden, Ruwase, Olatunji, He, Yuxiong

论文摘要

过去的几年见证了基于变压器的模型的成功,其规模和应用方案继续增长。当前变压器模型的景观越来越多样化:模型大小差异很大,最大的是数十亿个参数;模型特性由于特征的稀疏性而有所不同。目标应用程序方案可以是关键延迟或面向吞吐量的方案;部署硬件可以是具有不同类型的内存和存储等单身或多GPU系统。随着多样性的增加和变压器模型的快速发展速度,设计高性能和高效的推理系统非常具有挑战性。在本文中,我们提出了DeepSpeed推断,这是针对变压器模型推理的全面系统解决方案,以应对上述挑战。深速推理由(1)一种多GPU推理解决方案组成,可最大程度地减少潜伏期,同时将密集和稀疏变压器模型的吞吐量最大化,当它们适合GPU的总体内存,以及(2)一个异质推理解决方案,一种异质推理解决方案,该解决方案利用CPU和NVME记忆以及与GPU的记忆相同,可为GPU的内存和较大的模型提供良好的良好构成,以促进良好的模型。对于面向潜伏期的方案,深速推理可将延迟降低到最新的延迟,而对于面向吞吐量的方案,吞吐量将延迟减少到1.5倍以上。此外,它通过利用数百个GPU来实现实时延迟限制下的参数量表推理,这是一个前所未有的推理。它可以比仅使用GPU的解决方案更大的25倍模型,同时提供84个TFLOPS(超过$ 50 \%$ $ a6000峰值)。

The past several years have witnessed the success of transformer-based models, and their scale and application scenarios continue to grow aggressively. The current landscape of transformer models is increasingly diverse: the model size varies drastically with the largest being of hundred-billion parameters; the model characteristics differ due to the sparsity introduced by the Mixture-of-Experts; the target application scenarios can be latency-critical or throughput-oriented; the deployment hardware could be single- or multi-GPU systems with different types of memory and storage, etc. With such increasing diversity and the fast-evolving pace of transformer models, designing a highly performant and efficient inference system is extremely challenging. In this paper, we present DeepSpeed Inference, a comprehensive system solution for transformer model inference to address the above-mentioned challenges. DeepSpeed Inference consists of (1) a multi-GPU inference solution to minimize latency while maximizing the throughput of both dense and sparse transformer models when they fit in aggregate GPU memory, and (2) a heterogeneous inference solution that leverages CPU and NVMe memory in addition to the GPU memory and compute to enable high inference throughput with large models which do not fit in aggregate GPU memory. DeepSpeed Inference reduces latency by up to 7.3X over the state-of-the-art for latency-oriented scenarios and increases throughput by over 1.5x for throughput-oriented scenarios. Moreover, it enables trillion parameter scale inference under real-time latency constraints by leveraging hundreds of GPUs, an unprecedented scale for inference. It can inference 25x larger models than with GPU-only solutions, while delivering a high throughput of 84 TFLOPS (over $50\%$ of A6000 peak).

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源