从错误中学习：将错误的预测用作语言预训练中的伤害警报

论文标题

从错误中学习：将错误的预测用作语言预训练中的伤害警报

Learning from Mistakes: Using Mis-predictions as Harm Alerts in Language Pre-Training

论文作者

Xing, Chen, Liu, Wenhao, Xiong, Caiming

论文摘要

在培训数据中拟合复杂模式，例如推理和常识，是语言预训练的关键挑战。根据最近的研究和我们的经验观察，一个可能的原因是，训练数据中的某些易于拟合模式，例如经常同时发生的单词组合，主导和损害了预训练，这使得模型很难拟合更复杂的信息。我们认为，错误的预测可以帮助定位损害语言理解的这种主导模式。当发生错误的预测时，应该经常出现同时发生的模式，而模型拟合的错误预测的词会导致错误的预测。如果我们可以添加正则化以训练模型以较少依赖这种主导模式时，当发生错误的预测并更多地关注其余的更微妙的模式，那么可以在预训练中有效地拟合更多的信息。在这种动机之后，我们提出了一种新的语言预训练方法，将错误的预测作为伤害警报（MPA）。在MPA中，当在预训练期间发生错误的预测时，我们使用其同时出现信息来指导几个自我发项模块的负责人。变压器模块中的一些自我注意事项头被优化，以将较低的注意力权重分配给输入句子中的单词，这些句子经常与错误预测同时发生，同时将更高的权重分配给其他单词。通过这样做，Transformer模型受到训练，可以减少与错误预测的统治经常相互发生的模式，同时，当发生错误预测时，更多地关注其余的更复杂的信息。我们的实验表明，MPA加快了BERT和Electra的预培训，并改善了他们在下游任务上的表现。

Fitting complex patterns in the training data, such as reasoning and commonsense, is a key challenge for language pre-training. According to recent studies and our empirical observations, one possible reason is that some easy-to-fit patterns in the training data, such as frequently co-occurring word combinations, dominate and harm pre-training, making it hard for the model to fit more complex information. We argue that mis-predictions can help locate such dominating patterns that harm language understanding. When a mis-prediction occurs, there should be frequently co-occurring patterns with the mis-predicted word fitted by the model that lead to the mis-prediction. If we can add regularization to train the model to rely less on such dominating patterns when a mis-prediction occurs and focus more on the rest more subtle patterns, more information can be efficiently fitted at pre-training. Following this motivation, we propose a new language pre-training method, Mis-Predictions as Harm Alerts (MPA). In MPA, when a mis-prediction occurs during pre-training, we use its co-occurrence information to guide several heads of the self-attention modules. Some self-attention heads in the Transformer modules are optimized to assign lower attention weights to the words in the input sentence that frequently co-occur with the mis-prediction while assigning higher weights to the other words. By doing so, the Transformer model is trained to rely less on the dominating frequently co-occurring patterns with mis-predictions while focus more on the rest more complex information when mis-predictions occur. Our experiments show that MPA expedites the pre-training of BERT and ELECTRA and improves their performances on downstream tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题