基于FPGA的AI智能NIC，用于可扩展的分布式AI培训系统

论文标题

基于FPGA的AI智能NIC，用于可扩展的分布式AI培训系统

FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems

论文作者

Ma, Rui, Georganas, Evangelos, Heinecke, Alexander, Boutros, Andrew, Nurvitadhi, Eriko

论文摘要

人工智能（AI）技术的快速进步已导致无数次应用领域的准确性提高，以更大和更加注重的模型为代价。培训大量数据的这种模型通常需要扩展到许多计算节点，并在很大程度上依赖集体通信算法（例如全降序）来交换不同节点之间的重量梯度。这些集体通信操作在分布式AI训练系统中的开销可以瓶颈其性能，随着节点数量的增加，其效果更加明显。在本文中，我们首先通过分析分布AI培训来表征全面的操作开销。然后，我们为使用现场可编程栅极阵列（FPGA）提出了一个新的智能网络接口卡（NIC），用于分布式AI训练系统，以通过数据压缩来加速全减速操作并优化网络带宽利用率。 AI Smart NIC释放了系统的计算资源，以执行更算量的张量操作，并提高了总节点到节点的通信效率。与具有常规NIC的基线系统相比，我们对由6个计算节点组成的原型分布式AI训练系统进行真实测量，以评估我们提出的基于FPGA的AI SMART NIC的性能。我们还使用这些测量值来验证我们为缩放到较大系统时预测性能的分析模型。我们提出的基于FPGA的AI SMART NIC在6节点时以1.6倍的速度提高了1.6倍的总体培训性能，与使用常规NIC的基线系统相比，估计在32个节点的2.5倍性能提高。

Rapid advances in artificial intelligence (AI) technology have led to significant accuracy improvements in a myriad of application domains at the cost of larger and more compute-intensive models. Training such models on massive amounts of data typically requires scaling to many compute nodes and relies heavily on collective communication algorithms, such as all-reduce, to exchange the weight gradients between different nodes. The overhead of these collective communication operations in a distributed AI training system can bottleneck its performance, with more pronounced effects as the number of nodes increases. In this paper, we first characterize the all-reduce operation overhead by profiling distributed AI training. Then, we propose a new smart network interface card (NIC) for distributed AI training systems using field-programmable gate arrays (FPGAs) to accelerate all-reduce operations and optimize network bandwidth utilization via data compression. The AI smart NIC frees up the system's compute resources to perform the more compute-intensive tensor operations and increases the overall node-to-node communication efficiency. We perform real measurements on a prototype distributed AI training system comprised of 6 compute nodes to evaluate the performance gains of our proposed FPGA-based AI smart NIC compared to a baseline system with regular NICs. We also use these measurements to validate an analytical model that we formulate to predict performance when scaling to larger systems. Our proposed FPGA-based AI smart NIC enhances overall training performance by 1.6x at 6 nodes, with an estimated 2.5x performance improvement at 32 nodes, compared to the baseline system using conventional NICs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题