基于梯度的内部修剪预先训练的语言模型

论文标题

基于梯度的内部修剪预先训练的语言模型

Gradient-based Intra-attention Pruning on Pre-trained Language Models

论文作者

Yang, Ziqing, Cui, Yiming, Yao, Xin, Wang, Shijin

论文摘要

预训练的语言模型具有出色的性能，但计算上的昂贵。已经开发了诸如修剪和知识蒸馏之类的技术来减少其大小和潜伏期。在这项工作中，我们提出了一种结构化的修剪方法晶粒（基于梯度的注意力内修剪），该方法通过知识蒸馏进行特定于任务的修剪，并产生高效的模型。谷物与整体上的每个注意力头的通用方法不同，谷物检查和修剪注意力内结构，从而大大扩展了结构搜索空间并实现了更灵活的模型。我们还提出了一种梯度分离策略，以减少蒸馏对修剪的干扰，从而更好地组合两种方法。胶水，小队和Conll 2003上的实验表明，谷物的表现尤其优于其他方法，尤其是在高稀疏性方面，并且在维持$ 93 \％\％\ sim99 \％$ perfortury的同时，达到了$ 6 \ sim7 \ sim7 \ times $速度。在极端压缩的情况下，只有$ 3 \％$变压器的权重，与大型型号相比，修剪模型仍然具有竞争力。

Pre-trained language models achieve superior performance but are computationally expensive. Techniques such as pruning and knowledge distillation have been developed to reduce their sizes and latencies. In this work, we propose a structured pruning method GRAIN (Gradient-based Intra-attention pruning), which performs task-specific pruning with knowledge distillation and yields highly effective models. Different from common approaches that prune each attention head as a whole, GRAIN inspects and prunes intra-attention structures, which greatly expands the structure search space and enables more flexible models. We also propose a gradient separation strategy that reduces the interference of distillation on pruning for a better combination of the two approaches. Experiments on GLUE, SQuAD, and CoNLL 2003 show that GRAIN notably outperforms other methods, especially in the high sparsity regime, and achieves $6\sim7\times$ speedups while maintaining $93\%\sim99\%$ performance. Under extreme compression where only $3\%$ transformer weights remain, the pruned model is still competitive compared to larger models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题