中国词汇简化

论文标题

中国词汇简化

Chinese Lexical Simplification

论文作者

Qiang, Jipeng, Lu, Xinyu, Li, Yun, Yuan, Yunhao, Shi, Yang, Wu, Xindong

论文摘要

词汇简化引起了许多语言的关注，这是用更简单的等效含义替代给定句子中复杂词的过程。尽管中文词汇的丰富性使得为儿童和非本地演讲者阅读的文本很难阅读，但没有用于中国词汇简化（CLS）任务的研究工作。为了避免获取注释的困难，我们可以手动创建CLS的第一个基准数据集，该数据集可自动评估词汇简化系统。为了获得更彻底的比较，我们提出了五种不同类型的方法作为基础，以生成包括基于同义词的方法，基于单词嵌入的方法，基于语言的模型方法的替代候选者，基于语言模型的方法，基于半Ememe的方法和混合方法。最后，我们设计了对这些基准的实验评估，并讨论了它们的优势和缺点。据我们所知，这是CLS任务的首次研究。

Lexical simplification has attracted much attention in many languages, which is the process of replacing complex words in a given sentence with simpler alternatives of equivalent meaning. Although the richness of vocabulary in Chinese makes the text very difficult to read for children and non-native speakers, there is no research work for Chinese lexical simplification (CLS) task. To circumvent difficulties in acquiring annotations, we manually create the first benchmark dataset for CLS, which can be used for evaluating the lexical simplification systems automatically. In order to acquire more thorough comparison, we present five different types of methods as baselines to generate substitute candidates for the complex word that include synonym-based approach, word embedding-based approach, pretrained language model-based approach, sememe-based approach, and a hybrid approach. Finally, we design the experimental evaluation of these baselines and discuss their advantages and disadvantages. To our best knowledge, this is the first study for CLS task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题