关于紧凑型语言模型的预训练数据量的重要性

论文标题

关于紧凑型语言模型的预训练数据量的重要性

On the importance of pre-training data volume for compact language models

论文作者

Micheli, Vincent, d'Hoffschmidt, Martin, Fleuret, François

论文摘要

语言建模的最新进展导致了计算密集型和基于资源的最新模型。为了实现可持续实践，我们研究了培训数据量对紧凑型语言模型的影响。多个基于BERT的模型接受了逐渐增加数量的法国文本的培训。通过对法国问题回答数据集（FQUAD）的微调，我们观察到，表现出色的模型的文本最少是100 MB。此外，我们表明，过去关键的训练前数据量，在特定于任务的语料库上的中级预训练步骤不会带来实质性的改进。

Recent advances in language modeling have led to computationally intensive and resource-demanding state-of-the-art models. In an effort towards sustainable practices, we study the impact of pre-training data volume on compact language models. Multiple BERT-based models are trained on gradually increasing amounts of French text. Through fine-tuning on the French Question Answering Dataset (FQuAD), we observe that well-performing models are obtained with as little as 100 MB of text. In addition, we show that past critically low amounts of pre-training data, an intermediate pre-training step on the task-specific corpus does not yield substantial improvements.

下载PDF全文

下载文献需遵守相关版权规定

论文标题