论文标题

关于紧凑型语言模型的预训练数据量的重要性

On the importance of pre-training data volume for compact language models

论文作者

Micheli, Vincent, d'Hoffschmidt, Martin, Fleuret, François

论文摘要

语言建模的最新进展导致了计算密集型和基于资源的最新模型。为了实现可持续实践,我们研究了培训数据量对紧凑型语言模型的影响。多个基于BERT的模型接受了逐渐增加数量的法国文本的培训。通过对法国问题回答数据集(FQUAD)的微调,我们观察到,表现出色的模型的文本最少是100 MB。此外,我们表明,过去关键的训练前数据量,在特定于任务的语料库上的中级预训练步骤不会带来实质性的改进。

Recent advances in language modeling have led to computationally intensive and resource-demanding state-of-the-art models. In an effort towards sustainable practices, we study the impact of pre-training data volume on compact language models. Multiple BERT-based models are trained on gradually increasing amounts of French text. Through fine-tuning on the French Question Answering Dataset (FQuAD), we observe that well-performing models are obtained with as little as 100 MB of text. In addition, we show that past critically low amounts of pre-training data, an intermediate pre-training step on the task-specific corpus does not yield substantial improvements.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源