乌兹别克斯坦停止单词检测的准确性：关于“学校语料库”的案例研究

论文标题

乌兹别克斯坦停止单词检测的准确性：关于“学校语料库”的案例研究

Accuracy of the Uzbek stop words detection: a case study on "School corpus"

论文作者

Madatov, Khabibulla, Bekchanov, Shukurla, Vičič, Jernej

论文摘要

停止单词对于信息检索和文本分析调查自然语言处理任务非常重要。当前的工作提出了一种评估旨在自动创建技术的停止单词列表质量的方法。尽管本文提出的方法在自动生成的乌兹别克语语言的停止单词列表上进行了测试，但通过一些修改，可以应用于同一家族的类似语言或具有凝聚力性质的语言。由于乌兹别克语的语言属于凝集性语言的家族，因此可以解释说，语言中停止单词的自动检测是一个比易转语语言更复杂的过程。此外，我们通过调查如何自动分析乌兹别克斯坦文本中的停止单词的检测，将以前的工作纳入了停止单词检测。这项工作致力于回答是否有一种很好的方法来评估乌兹别克文本的可用停止单词，或者是否有可能通过研究独特单词概率的数值特征来确定乌兹别克斯坦句子的哪个部分包含大多数停止单词。结果显示停止单词列表的准确性可接受。

Stop words are very important for information retrieval and text analysis investigation tasks of natural language processing. Current work presents a method to evaluate the quality of a list of stop words aimed at automatically creating techniques. Although the method proposed in this paper was tested on an automatically-generated list of stop words for the Uzbek language, it can be, with some modifications, applied to similar languages either from the same family or the ones that have an agglutinative nature. Since the Uzbek language belongs to the family of agglutinative languages, it can be explained that the automatic detection of stop words in the language is a more complex process than in inflected languages. Moreover, we integrated our previous work on stop words detection in the example of the "School corpus" by investigating how to automatically analyse the detection of stop words in Uzbek texts. This work is devoted to answering whether there is a good way of evaluating available stop words for Uzbek texts, or whether it is possible to determine what part of the Uzbek sentence contains the majority of the stop words by studying the numerical characteristics of the probability of unique words. The results show acceptable accuracy of the stop words lists.

下载PDF全文

下载文献需遵守相关版权规定

论文标题