论文标题
弥合语言模型和跨语性序列标签之间的差距
Bridging the Gap between Language Models and Cross-Lingual Sequence Labeling
论文作者
论文摘要
大规模的跨语言预训练的语言模型(XPLM)通过将知识从高资源语言转移到低资源语言,显示了跨语言序列标记任务(XSL)的有效性,例如跨语言机器阅读理解(XMRC)。尽管取得了巨大的成功,但我们提出了一个经验观察,即在培训和微调阶段之间存在训练目标差距:在本文中,我们首先设计了针对XSL量身定制的跨语性语言信息跨度掩盖(CLISM)量身定制的预训练任务,以自我监督的方式消除客观差距。其次,我们提出了对比矛盾的正则化(CACR),它利用对比度学习来鼓励通过在预训练期间通过无监督的跨语言实例训练信号来鼓励输入平行序列的表示之间的一致性。通过这些方式,我们的方法不仅弥合了预处理前的差距,而且还可以增强PLM,以更好地捕获不同语言之间的一致性。广泛的实验证明,我们的方法在培训前数据有限的多个XSL基准上取得了明显的效果。我们的方法还可以通过几个射击数据设置在很大的边距中超过先前的最新方法,其中只有几百个培训示例可用。
Large-scale cross-lingual pre-trained language models (xPLMs) have shown effectiveness in cross-lingual sequence labeling tasks (xSL), such as cross-lingual machine reading comprehension (xMRC) by transferring knowledge from a high-resource language to low-resource languages. Despite the great success, we draw an empirical observation that there is a training objective gap between pre-training and fine-tuning stages: e.g., mask language modeling objective requires local understanding of the masked token and the span-extraction objective requires global understanding and reasoning of the input passage/paragraph and question, leading to the discrepancy between pre-training and xMRC. In this paper, we first design a pre-training task tailored for xSL named Cross-lingual Language Informative Span Masking (CLISM) to eliminate the objective gap in a self-supervised manner. Second, we present ContrAstive-Consistency Regularization (CACR), which utilizes contrastive learning to encourage the consistency between representations of input parallel sequences via unsupervised cross-lingual instance-wise training signals during pre-training. By these means, our methods not only bridge the gap between pretrain-finetune, but also enhance PLMs to better capture the alignment between different languages. Extensive experiments prove that our method achieves clearly superior results on multiple xSL benchmarks with limited pre-training data. Our methods also surpass the previous state-of-the-art methods by a large margin in few-shot data settings, where only a few hundred training examples are available.