论文标题
内域语料库大小对训练前BERT的影响
The Effects of In-domain Corpus Size on pre-training BERT
论文作者
论文摘要
许多先前的语言建模工作表明,对内域语料库进行预训练可以显着提高下游域特异性NLP任务的性能。但是,与收集足够的域数据相关的困难可能会阻止研究人员接触此预训练任务。在本文中,我们通过具有不同尺寸的生物医学语料库的变压器(BERT)的训练前双向编码器进行了一系列实验。结果表明,与训练步骤有限的相对较少数量的内域数据(4GB)进行预训练,与在一般语料库中预先训练的微调模型相比,下游域特异性NLP任务的性能更好。
Many prior language modeling efforts have shown that pre-training on an in-domain corpus can significantly improve performance on downstream domain-specific NLP tasks. However, the difficulties associated with collecting enough in-domain data might discourage researchers from approaching this pre-training task. In this paper, we conducted a series of experiments by pre-training Bidirectional Encoder Representations from Transformers (BERT) with different sizes of biomedical corpora. The results demonstrate that pre-training on a relatively small amount of in-domain data (4GB) with limited training steps, can lead to better performance on downstream domain-specific NLP tasks compared with fine-tuning models pre-trained on general corpora.