关于预先训练的变压器模型掉落层的影响

论文标题

关于预先训练的变压器模型掉落层的影响

On the Effect of Dropping Layers of Pre-trained Transformer Models

论文作者

Sajjad, Hassan, Dalvi, Fahim, Durrani, Nadir, Nakov, Preslav

论文摘要

基于变压器的NLP模型是使用数亿甚至数十亿个参数训练的，从而限制了其在计算受限环境中的适用性。尽管参数的数量通常与性能相关，但尚不清楚下游任务是否需要整个网络。在最近的修剪和提炼预培训模型的最新工作中，我们探索了在预训练模型中放下层的策略，并观察修剪对下游胶水任务的影响。我们能够修剪Bert，Roberta和XLNet型号高达40％，同时维持其原始性能的98％。此外，我们证明，在大小和性能方面，您的修剪模型与使用知识蒸馏构建的模型相当。我们的实验产生了有趣的观察结果，例如（i）下层对于维持下游任务性能至关重要，（ii）某些任务（例如释义检测和句子相似性）对于降低层的删除更强大，并且（iii）使用不同的目标函数训练的模型表现出不同的学习模式，并且表现出不同的学习模式和W.R.R.T层。

Transformer-based NLP models are trained using hundreds of millions or even billions of parameters, limiting their applicability in computationally constrained environments. While the number of parameters generally correlates with performance, it is not clear whether the entire network is required for a downstream task. Motivated by the recent work on pruning and distilling pre-trained models, we explore strategies to drop layers in pre-trained models, and observe the effect of pruning on downstream GLUE tasks. We were able to prune BERT, RoBERTa and XLNet models up to 40%, while maintaining up to 98% of their original performance. Additionally we show that our pruned models are on par with those built using knowledge distillation, both in terms of size and performance. Our experiments yield interesting observations such as, (i) the lower layers are most critical to maintain downstream task performance, (ii) some tasks such as paraphrase detection and sentence similarity are more robust to the dropping of layers, and (iii) models trained using a different objective function exhibit different learning patterns and w.r.t the layer dropping.

下载PDF全文

下载文献需遵守相关版权规定

论文标题