拼写：将语音和视觉相似性纳入中文拼写检查的语言模型

论文标题

拼写：将语音和视觉相似性纳入中文拼写检查的语言模型

SpellGCN: Incorporating Phonological and Visual Similarities into Language Models for Chinese Spelling Check

论文作者

Cheng, Xingyi, Xu, Weidi, Chen, Kunlong, Jiang, Shaohua, Wang, Feng, Wang, Taifeng, Chu, Wei, Qi, Yuan

论文摘要

中文拼写检查（CSC）是以中文自然语言检测和纠正拼写错误的任务。现有方法已尝试结合汉字之间的相似性知识。但是，他们将相似性知识视为外部输入资源或仅仅是启发式规则。本文建议通过专业的图形卷积网络（SpellGCN）将语音和视觉相似性知识纳入CSC的语言模型。该模型在字符上构建图形，并学会了SpellGCN将此图映射到一组相互依赖的字符分类器中。这些分类器应用于另一个网络提取的表示形式，例如BERT，使整个网络能够端到端训练。实验（本文的数据集和所有代码均可在https://github.com/acl2020spellgcn/spellgcn上获得）。我们的方法通过很大的余量实现了对先前模型的卓越性能。

Chinese Spelling Check (CSC) is a task to detect and correct spelling errors in Chinese natural language. Existing methods have made attempts to incorporate the similarity knowledge between Chinese characters. However, they take the similarity knowledge as either an external input resource or just heuristic rules. This paper proposes to incorporate phonological and visual similarity knowledge into language models for CSC via a specialized graph convolutional network (SpellGCN). The model builds a graph over the characters, and SpellGCN is learned to map this graph into a set of inter-dependent character classifiers. These classifiers are applied to the representations extracted by another network, such as BERT, enabling the whole network to be end-to-end trainable. Experiments (The dataset and all code for this paper are available at https://github.com/ACL2020SpellGCN/SpellGCN) are conducted on three human-annotated datasets. Our method achieves superior performance against previous models by a large margin.

下载PDF全文

下载文献需遵守相关版权规定

论文标题