利用单词语义来丰富中文预培训模型的性格表示

论文标题

利用单词语义来丰富中文预培训模型的性格表示

Exploiting Word Semantics to Enrich Character Representations of Chinese Pre-trained Models

论文作者

Li, Wenbiao, Sun, Rui, Wu, Yunfang

论文摘要

大多数中国预训练的模型都采用字符作为下游任务的基本单元。但是，这些模型忽略了单词传递的信息，从而导致一些重要语义的丧失。在本文中，我们提出了一种新方法来利用单词结构并将词汇语义集成到预训练模型的特征表示中。具体来说，我们根据相似度的重量将单词嵌入其内部字符的嵌入中。为了加强边界信息一词，我们将一个单词中内部字符的表示形式混合在一起。之后，我们将单词到字符对齐的注意机制通过掩盖不重要的角色来强调重要角色。此外，为了减少单词分割引起的错误传播，我们提出了一种组合方法，以结合不同的标记者给出的分割结果。实验结果表明，我们的方法在不同的中文NLP任务上取得了优于基本预训练的模型Bert，Bert-WWM和Ernie的表现：情感分类，句子对匹配，自然语言推论和机器阅读理解。我们进行进一步的分析以证明我们模型每个组成部分的有效性。

Most of the Chinese pre-trained models adopt characters as basic units for downstream tasks. However, these models ignore the information carried by words and thus lead to the loss of some important semantics. In this paper, we propose a new method to exploit word structure and integrate lexical semantics into character representations of pre-trained models. Specifically, we project a word's embedding into its internal characters' embeddings according to the similarity weight. To strengthen the word boundary information, we mix the representations of the internal characters within a word. After that, we apply a word-to-character alignment attention mechanism to emphasize important characters by masking unimportant ones. Moreover, in order to reduce the error propagation caused by word segmentation, we present an ensemble approach to combine segmentation results given by different tokenizers. The experimental results show that our approach achieves superior performance over the basic pre-trained models BERT, BERT-wwm and ERNIE on different Chinese NLP tasks: sentiment classification, sentence pair matching, natural language inference and machine reading comprehension. We make further analysis to prove the effectiveness of each component of our model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题