预处理中文伯特检测单词插入和删除错误

论文标题

预处理中文伯特检测单词插入和删除错误

Pretraining Chinese BERT for Detecting Word Insertion and Deletion Errors

论文作者

Zhou, Cong, Dai, Yong, Tang, Duyu, Zhao, Enbo, Feng, Zhangyin, Kuang, Li, Shi, Shuming

论文摘要

中国BERT模型在处理单词替代的语法错误时取得了显着的进步。但是，他们无法处理单词插入和删除，因为伯特假设每个位置都存在一个单词。为了解决这个问题，我们提出了一个简单有效的中国预算模型。基本思想是使模型能够确定一个单词是否存在在特定位置。我们通过引入一个特殊的令牌\ texttt {[null]}来实现这一目标，该预测代表单词的不存在。在训练阶段，我们设计了预训练的任务，使该模型学会预测\ texttt {[null]}，并共同鉴于周围环境。在推理阶段，该模型很容易检测到应该使用标准的蒙版语言建模功能插入或删除单词。我们进一步创建了一个评估数据集，以促进有关单词插入和删除的研究。它包括针对7,726个错误句子的人类注销的校正。结果表明，现有的中国BERT在检测插入和删除错误时表现较差。我们的方法将单词插入的F1分数从24.1 \％提高到78.1 \％，分别从26.5 \％到68.5 \％的单词删除。

Chinese BERT models achieve remarkable progress in dealing with grammatical errors of word substitution. However, they fail to handle word insertion and deletion because BERT assumes the existence of a word at each position. To address this, we present a simple and effective Chinese pretrained model. The basic idea is to enable the model to determine whether a word exists at a particular position. We achieve this by introducing a special token \texttt{[null]}, the prediction of which stands for the non-existence of a word. In the training stage, we design pretraining tasks such that the model learns to predict \texttt{[null]} and real words jointly given the surrounding context. In the inference stage, the model readily detects whether a word should be inserted or deleted with the standard masked language modeling function. We further create an evaluation dataset to foster research on word insertion and deletion. It includes human-annotated corrections for 7,726 erroneous sentences. Results show that existing Chinese BERT performs poorly on detecting insertion and deletion errors. Our approach significantly improves the F1 scores from 24.1\% to 78.1\% for word insertion and from 26.5\% to 68.5\% for word deletion, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题