论文标题

判别语言模型预训练的实例正则化

Instance Regularization for Discriminative Language Model Pre-training

论文作者

Zhang, Zhuosheng, Zhao, Hai, Zhou, Ming

论文摘要

判别性训练的语言模型(PRLMS)可以推广为具有两个程序,即引用和转化的自动编码器。首先,一个注册过程会损坏具有任意尖锐功能的文本以构建培训实例。然后,训练了一种denoising语言模型来恢复损坏的令牌。现有的研究通过优化独立策略或陈述的独立策略而取得了进步。他们在整个培训过程中平等对待培训实例,而对这些实例的个人贡献很少关注。为了模拟实例贡献的明确信号,这项工作提出了估算在语言模型预训练中恢复损坏句子的原始句子的复杂性。这些估计涉及腐败程度的数据构建过程和对denoising对应物的预测信心。自然语言理解和阅读理解基准的实验结果表明,我们的方法提高了训练效率,有效性和鲁棒性。代码可在https://github.com/cooelf/instancereg上公开获取。

Discriminative pre-trained language models (PrLMs) can be generalized as denoising auto-encoders that work with two procedures, ennoising and denoising. First, an ennoising process corrupts texts with arbitrary noising functions to construct training instances. Then, a denoising language model is trained to restore the corrupted tokens. Existing studies have made progress by optimizing independent strategies of either ennoising or denosing. They treat training instances equally throughout the training process, with little attention on the individual contribution of those instances. To model explicit signals of instance contribution, this work proposes to estimate the complexity of restoring the original sentences from corrupted ones in language model pre-training. The estimations involve the corruption degree in the ennoising data construction process and the prediction confidence in the denoising counterpart. Experimental results on natural language understanding and reading comprehension benchmarks show that our approach improves pre-training efficiency, effectiveness, and robustness. Code is publicly available at https://github.com/cooelf/InstanceReg

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源