变压器的快速训练后修剪框架

论文标题

变压器的快速训练后修剪框架

A Fast Post-Training Pruning Framework for Transformers

论文作者

Kwon, Woosuk, Kim, Sehoon, Mahoney, Michael W., Hassoun, Joseph, Keutzer, Kurt, Gholami, Amir

论文摘要

修剪是降低变压器模型的巨大推理成本的有效方法。但是，修剪变压器的先前工作需要重新训练模型。这可能会增加高训练成本和高度复杂性，以模型部署，这使得在许多实际情况下都难以使用。为了解决这个问题，我们为变压器提出了一个快速的训练后修剪框架，该框架不需要任何重新训练。给定资源约束和示例数据集，我们的框架可以使用结构化的稀疏方法自动修剪变压器模型。为了保持高精度而无需再培训，我们介绍了三种新型技术：（i）一种轻巧的掩模搜索算法，可以根据Fisher信息找到哪些头和过滤器；（ii）对搜索算法补充的面具重排；（iii）掩盖调谐，可重建每一层的输出激活。我们将我们的方法应用于Bert-Base和Distilbert，并评估其在胶水和小队基准上的有效性。我们的框架在推理潜伏期的拖鞋和1.56倍加速度的降低最多可达到2.0倍，同时保持<1％的准确性损失。重要的是，我们的框架在不到3分钟的单个GPU上修剪了变压器，该GPU的数量级超过两个数量级，比重新训练模型的现有修剪方法要快。

Pruning is an effective way to reduce the huge inference cost of Transformer models. However, prior work on pruning Transformers requires retraining the models. This can add high training cost and high complexity to model deployment, making it difficult to use in many practical situations. To address this, we propose a fast post-training pruning framework for Transformers that does not require any retraining. Given a resource constraint and a sample dataset, our framework automatically prunes the Transformer model using structured sparsity methods. To retain high accuracy without retraining, we introduce three novel techniques: (i) a lightweight mask search algorithm that finds which heads and filters to prune based on the Fisher information; (ii) mask rearrangement that complements the search algorithm; and (iii) mask tuning that reconstructs the output activations for each layer. We apply our method to BERT-base and DistilBERT, and we evaluate its effectiveness on GLUE and SQuAD benchmarks. Our framework achieves up to 2.0x reduction in FLOPs and 1.56x speedup in inference latency, while maintaining < 1% loss in accuracy. Importantly, our framework prunes Transformers in less than 3 minutes on a single GPU, which is over two orders of magnitude faster than existing pruning approaches that retrain the models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题