基于尼泊尔规则的Stemmer及其在不同NLP应用程序上的性能

论文标题

基于尼泊尔规则的Stemmer及其在不同NLP应用程序上的性能

A Nepali Rule Based Stemmer and its performance on different NLP applications

论文作者

Koirala, Pravesh, Shakya, Aman

论文摘要

Stemming是自然语言处理（NLP）不可或缺的一部分。这是几乎每个NLP应用程序中的预处理步骤。可以说，最重要的词干用法是信息检索（ir）。虽然在像英语这样的语言上做了很多工作，但尼泊尔词根只有少数作品。这项研究重点是为尼泊尔文本创建基于规则的茎。具体而言，这是一个词缀剥离系统，它可以识别尼泊尔语法中的两种不同类别的后缀并分别剥离它们。仅确定并剥离单个负性前缀（NA）。这项研究着重于许多技术，例如异常单词识别，形态归一化和单词转换以提高茎的性能。使用PAICE的方法在本质上对STEMMER进行了测试，并在基本的基于TF-IDF的IR系统和使用多项式NAIVE BAYES分类器的基本新闻主题分类器上进行外部测试。分析了和不使用Stemmer的这些系统的性能差异。

Stemming is an integral part of Natural Language Processing (NLP). It's a preprocessing step in almost every NLP application. Arguably, the most important usage of stemming is in Information Retrieval (IR). While there are lots of work done on stemming in languages like English, Nepali stemming has only a few works. This study focuses on creating a Rule Based stemmer for Nepali text. Specifically, it is an affix stripping system that identifies two different class of suffixes in Nepali grammar and strips them separately. Only a single negativity prefix (Na) is identified and stripped. This study focuses on a number of techniques like exception word identification, morphological normalization and word transformation to increase stemming performance. The stemmer is tested intrinsically using Paice's method and extrinsically on a basic tf-idf based IR system and an elementary news topic classifier using Multinomial Naive Bayes Classifier. The difference in performance of these systems with and without using the stemmer is analysed.

下载PDF全文

下载文献需遵守相关版权规定

论文标题