论文标题

技术语言处理中的停止词

Stopwords in Technical Language Processing

论文作者

Sarica, Serhad, Luo, Jianxi

论文摘要

自然语言处理技术越来越多地应用于工程环境中的信息检索,索引和主题建模。此类任务的一个标准组成部分是删除停车词,这是数据的不信息组成部分。虽然研究人员使用易于可用的通用英语语言列表,但工程领域的技术术语包含其高度频繁且不信息的单词,并且没有用于技术语言处理应用程序的标准停止列表。在这里,我们通过严格识别工程文本中的通用,微不足道的,无关紧要的停止词,而不是一般文本中的停止文字,基于替代数据驱动的方法的综合,并策划一个准备技术语言处理应用程序的定格列表。

There are increasingly applications of natural language processing techniques for information retrieval, indexing and topic modelling in the engineering contexts. A standard component of such tasks is the removal of stopwords, which are uninformative components of the data. While researchers use readily available stopword lists which are derived for general English language, the technical jargon of engineering fields contains their own highly frequent and uninformative words and there exists no standard stopword list for technical language processing applications. Here we address this gap by rigorously identifying generic, insignificant, uninformative stopwords in engineering texts beyond the stopwords in general texts, based on the synthesis of alternative data-driven approaches, and curating a stopword list ready for technical language processing applications.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源