论文标题

Robbertje:一种蒸馏的荷兰伯特模型

RobBERTje: a Distilled Dutch BERT Model

论文作者

Delobelle, Pieter, Winters, Thomas, Berendt, Bettina

论文摘要

预先训练的大规模语言模型(例如Bert)由于其在各种自然语言任务上的出色表现而引起了很多关注。但是,由于它们的参数数量众多,它们都可以部署和微调。研究人员创建了几种将语言模型提升为较小的方法,以提高效率,并以较小的性能权衡。在本文中,我们创建了几种不同的蒸馏版本的最先进的荷兰罗伯特模型,并将其称为Robbertje。蒸馏的蒸馏语料库有所不同,即是否被改组,以及是否与后续句子合并。我们发现,对于大多数任务而言,使用洗牌与非改组数据集的模型的性能相似,并且在语料库中随机合并后续句子会创建模型,从而更快地训练并在具有长序列的任务上执行更好的训练。在比较蒸馏架构后,我们发现较大的大型大型架构的起作用明显优于bort hyperparametrization。有趣的是,我们还发现,蒸馏型模型表现出的性别 - 性型偏见少于其教师模型。由于较小的体系结构减少了进行微调的时间,因此这些模型允许更有效的培训和更轻量级的下游语言任务部署。

Pre-trained large-scale language models such as BERT have gained a lot of attention thanks to their outstanding performance on a wide range of natural language tasks. However, due to their large number of parameters, they are resource-intensive both to deploy and to fine-tune. Researchers have created several methods for distilling language models into smaller ones to increase efficiency, with a small performance trade-off. In this paper, we create several different distilled versions of the state-of-the-art Dutch RobBERT model and call them RobBERTje. The distillations differ in their distillation corpus, namely whether or not they are shuffled and whether they are merged with subsequent sentences. We found that the performance of the models using the shuffled versus non-shuffled datasets is similar for most tasks and that randomly merging subsequent sentences in a corpus creates models that train faster and perform better on tasks with long sequences. Upon comparing distillation architectures, we found that the larger DistilBERT architecture worked significantly better than the Bort hyperparametrization. Interestingly, we also found that the distilled models exhibit less gender-stereotypical bias than its teacher model. Since smaller architectures decrease the time to fine-tune, these models allow for more efficient training and more lightweight deployment of many Dutch downstream language tasks.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源