论文标题
Quala-Minilm:量化的长度自适应微小
QuaLA-MiniLM: a Quantized Length Adaptive MiniLM
论文作者
论文摘要
有限的计算预算通常会阻止变形金刚用于生产和使用高精度。知识蒸馏方法通过将BERT自动化为具有较少层和较小内部嵌入的较小变压器表示来解决计算效率。但是,这些模型的性能会随着我们减少层数的数量而下降,尤其是在高级NLP任务(例如SPAN询问回答)中。此外,必须针对其独特的计算预算对每个推理方案进行单独的模型。 Dynamic-Tinybert通过部分将长度自适应变压器(LAT)技术实现到Tinybert上,从而解决了这两种局限性,从而在Bert-Bas上实现了X3加速,而精度的损失最小。在这项工作中,我们扩展了动态tinybert方法,以生成更高效率的模型。我们与LAT方法共同使用Minilm蒸馏,我们通过应用低位量化进一步提高了效率。我们量化的长度自适应微小模型(Quala-Minilm)仅经过一次训练,动态适合任何推理方案,并实现准确的效率折衷,而在squead1.1数据集中的任何计算预算上,都要优于任何其他有效方法(x8.8速度高达x8.8的速度,准确率<1%的准确性损失)。重现这项工作的代码在GitHub上公开可用。
Limited computational budgets often prevent transformers from being used in production and from having their high accuracy utilized. A knowledge distillation approach addresses the computational efficiency by self-distilling BERT into a smaller transformer representation having fewer layers and smaller internal embedding. However, the performance of these models drops as we reduce the number of layers, notably in advanced NLP tasks such as span question answering. In addition, a separate model must be trained for each inference scenario with its distinct computational budget. Dynamic-TinyBERT tackles both limitations by partially implementing the Length Adaptive Transformer (LAT) technique onto TinyBERT, achieving x3 speedup over BERT-base with minimal accuracy loss. In this work, we expand the Dynamic-TinyBERT approach to generate a much more highly efficient model. We use MiniLM distillation jointly with the LAT method, and we further enhance the efficiency by applying low-bit quantization. Our quantized length-adaptive MiniLM model (QuaLA-MiniLM) is trained only once, dynamically fits any inference scenario, and achieves an accuracy-efficiency trade-off superior to any other efficient approaches per any computational budget on the SQuAD1.1 dataset (up to x8.8 speedup with <1% accuracy loss). The code to reproduce this work is publicly available on Github.