PowerNorm：重新思考变形金刚中的批处理

论文标题

PowerNorm：重新思考变形金刚中的批处理

PowerNorm: Rethinking Batch Normalization in Transformers

论文作者

Shen, Sheng, Yao, Zhewei, Gholami, Amir, Mahoney, Michael W., Keutzer, Kurt

论文摘要

自然语言处理（NLP）中使用的神经网络（NN）模型的标准归一化方法是层归一化（LN）。这与在计算机视觉中广泛补充的批次归一化（BN）不同。 NLP中LN的首选使用主要是由于经验观察结果，即（天真/香草）使用BN会导致NLP任务的显着绩效降解。但是，对此的根本原因的彻底理解并不总是显而易见的。在本文中，我们对NLP变压器模型进行了系统的研究，以了解与LN相比，BN为何表现较差。我们发现，整个培训中的NLP数据的统计数据在整个培训中都显示出很大的波动。如果BN被天真地实施，这会导致不稳定。为了解决这个问题，我们提出了功率归一化（PN），这是一种新型的归一化方案，通过（i）在BN中放松零均值的归一化来解决此问题，（ii）（ii）结合运行的二次均值平均值，而不是每个批次统计量，以稳定波动，以及（iii）使用近似反射量将运行的统计数据纳入前进统计范围。从理论上讲，在轻度假设下，与BN相比，PN导致损失的Lipschitz常数较小。此外，我们证明了近似反向传播方案导致有限的梯度。我们在一系列NLP任务上广泛测试了变压器的PN，我们表明它的表现均优于LN和BN。特别是，在IWSLT14/WMT14上，PN优于0.4/0.6 BLEU，而PTB/Wikitext-103上的PN优于LN。我们在\ url {https://github.com/sincerass/powernorm}上公开代码。

The standard normalization method for neural network (NN) models used in Natural Language Processing (NLP) is layer normalization (LN). This is different than batch normalization (BN), which is widely-adopted in Computer Vision. The preferred use of LN in NLP is principally due to the empirical observation that a (naive/vanilla) use of BN leads to significant performance degradation for NLP tasks; however, a thorough understanding of the underlying reasons for this is not always evident. In this paper, we perform a systematic study of NLP transformer models to understand why BN has a poor performance, as compared to LN. We find that the statistics of NLP data across the batch dimension exhibit large fluctuations throughout training. This results in instability, if BN is naively implemented. To address this, we propose Power Normalization (PN), a novel normalization scheme that resolves this issue by (i) relaxing zero-mean normalization in BN, (ii) incorporating a running quadratic mean instead of per batch statistics to stabilize fluctuations, and (iii) using an approximate backpropagation for incorporating the running statistics in the forward pass. We show theoretically, under mild assumptions, that PN leads to a smaller Lipschitz constant for the loss, compared with BN. Furthermore, we prove that the approximate backpropagation scheme leads to bounded gradients. We extensively test PN for transformers on a range of NLP tasks, and we show that it significantly outperforms both LN and BN. In particular, PN outperforms LN by 0.4/0.6 BLEU on IWSLT14/WMT14 and 5.6/3.0 PPL on PTB/WikiText-103. We make our code publicly available at \url{https://github.com/sIncerass/powernorm}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题