在Google TPU上探索ML培训中并发的限制

论文标题

在Google TPU上探索ML培训中并发的限制

Exploring the limits of Concurrency in ML Training on Google TPUs

论文作者

Kumar, Sameer, Bradbury, James, Young, Cliff, Wang, Yu Emma, Levskaya, Anselm, Hechtman, Blake, Chen, Dehao, Lee, HyoukJoong, Deveci, Mehmet, Kumar, Naveen, Kanwar, Pankaj, Wang, Shibo, Wanderman-Milne, Skye, Lacy, Steve, Wang, Tao, Oguntebi, Tayo, Zu, Yazhou, Xu, Yuanzhong, Swing, Andy

论文摘要

使用神经网络的语言理解的最新结果需要培训前所未有的训练硬件，成千上万的芯片在一次培训中进行了合作。本文介绍了Google TPU多站上的ScaleML模型的技术，该模型具有4096 TPU-V3芯片的网格。我们讨论了模型并行级别的缩放比例限制，从数据并行性，通信/集体优化，培训指标的分布式评估以及主机输入处理量表优化中的分布式评估中的固定批次大小。这些技术在TensorFlow和JAX编程框架中均符合。我们还从最近的Google提交到MLPERF-V0.7基准竞赛中提出了绩效结果，在Google TPU-V3 Multipod Machine上的四个MLPERF模型中，在四个MLPERF模型中实现了创纪录的训练时间。

Recent results in language understanding using neural networks have required training hardware of unprecedentedscale, with thousands of chips cooperating on a single training run. This paper presents techniques to scaleML models on the Google TPU Multipod, a mesh with 4096 TPU-v3 chips. We discuss model parallelism toovercome scaling limitations from the fixed batch size in data parallelism, communication/collective optimizations,distributed evaluation of training metrics, and host input processing scaling optimizations. These techniques aredemonstrated in both the TensorFlow and JAX programming frameworks. We also present performance resultsfrom the recent Google submission to the MLPerf-v0.7 benchmark contest, achieving record training times from16 to 28 seconds in four MLPerf models on the Google TPU-v3 Multipod machine.

下载PDF全文

下载文献需遵守相关版权规定

论文标题