论文标题

廉价的域名语言模型的适应:生物医学NER和COVID-19 QA的案例研究

Inexpensive Domain Adaptation of Pretrained Language Models: Case Studies on Biomedical NER and Covid-19 QA

论文作者

Poerner, Nina, Waltinger, Ulli, Schütze, Hinrich

论文摘要

经过预告片模型(PTLM)的域适应性通常是通过对目标域文本进行预定的预定来实现的。尽管成功,但就硬件,运行时和CO_2排放而言,这种方法很昂贵。在这里,我们提出了一个更便宜的替代方案:我们在目标域文本上训练Word2Vec,并将结果的单词向量与通用域PTLM的文字向量保持一致。我们对八个生物医学命名实体识别(NER)任务进行评估,并与最近提出的生物Biobert模型进行比较。我们覆盖了Biobert-Bert F1 Delta的60%以上,占Biobert CO_2足迹的5%,其2%的云计算成本的2%。我们还展示了如何快速将现有的通用域问题答案(QA)模型转化为新兴领域:COVID-19大流行。

Domain adaptation of Pretrained Language Models (PTLMs) is typically achieved by unsupervised pretraining on target-domain text. While successful, this approach is expensive in terms of hardware, runtime and CO_2 emissions. Here, we propose a cheaper alternative: We train Word2Vec on target-domain text and align the resulting word vectors with the wordpiece vectors of a general-domain PTLM. We evaluate on eight biomedical Named Entity Recognition (NER) tasks and compare against the recently proposed BioBERT model. We cover over 60% of the BioBERT-BERT F1 delta, at 5% of BioBERT's CO_2 footprint and 2% of its cloud compute cost. We also show how to quickly adapt an existing general-domain Question Answering (QA) model to an emerging domain: the Covid-19 pandemic.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源