使用特定域的停止字列表加速文本挖掘

论文标题

使用特定域的停止字列表加速文本挖掘

Accelerating Text Mining Using Domain-Specific Stop Word Lists

论文作者

Alshanik, Farah, Apon, Amy, Herzog, Alexander, Safro, Ilya, Sybrandt, Justin

论文摘要

文本预处理是文本挖掘的重要步骤。删除可能对预测算法质量产生负面影响或不够信息的单词是文本索引中的一种关键存储技术，并提高了计算效率。通常，无论域如何，都将通用的停止单词列表应用于数据集。但是，许多共同的单词从一个域到另一个领域都不同，但在特定领域中没有意义。在语料库中消除特定领域的常用单词可降低特征空间的维度，并提高文本挖掘任务的性能。在本文中，我们提出了一种新型的数学方法，用于自动提取称为基于超平面方法的域特异性单词。这种新方法取决于矢量空间中单词的低维表示及其与超平面的距离的概念。基于超平面的方法可以通过消除无关的特征来显着降低文本维度。我们将基于超平面的方法与其他特征选择方法进行比较，即\ c {hi} 2和相互信息。在三个不同的数据集和五种分类算法上进行了一项实验研究，并测量降低维度的降低和分类性能的增加。结果表明，基于超平面的方法可以将语料库的维度降低90％，并且胜过互助。识别特定域单词的计算时间明显低于共同信息。

Text preprocessing is an essential step in text mining. Removing words that can negatively impact the quality of prediction algorithms or are not informative enough is a crucial storage-saving technique in text indexing and results in improved computational efficiency. Typically, a generic stop word list is applied to a dataset regardless of the domain. However, many common words are different from one domain to another but have no significance within a particular domain. Eliminating domain-specific common words in a corpus reduces the dimensionality of the feature space, and improves the performance of text mining tasks. In this paper, we present a novel mathematical approach for the automatic extraction of domain-specific words called the hyperplane-based approach. This new approach depends on the notion of low dimensional representation of the word in vector space and its distance from hyperplane. The hyperplane-based approach can significantly reduce text dimensionality by eliminating irrelevant features. We compare the hyperplane-based approach with other feature selection methods, namely \c{hi}2 and mutual information. An experimental study is performed on three different datasets and five classification algorithms, and measure the dimensionality reduction and the increase in the classification performance. Results indicate that the hyperplane-based approach can reduce the dimensionality of the corpus by 90% and outperforms mutual information. The computational time to identify the domain-specific words is significantly lower than mutual information.

下载PDF全文

下载文献需遵守相关版权规定

论文标题