论文标题

通过递归和迭代缺失进行定量停止词生成情感分析

Quantitative Stopword Generation for Sentiment Analysis via Recursive and Iterative Deletion

论文作者

DiPietro, Daniel M.

论文摘要

停止词的语义信息很少,并且通常会从文本数据中删除,以减少数据集大小并改善机器学习模型的性能。因此,研究人员试图开发用于生成有效止字字集的技术。先前的方法范围从依赖语言专家的定性技术到使用在语料库中计算的相关性或频率依赖性指标提取单词重要性的统计方法。我们提出了一种新颖的定量方法,该方法采用迭代和递归特征删除算法来查看哪些单词可以从预先训练的变压器的词汇中删除,最少降级到其性能,特别是用于情感分析的任务。从经验上讲,通过这种方法生成的停止列表大大降低了数据集的大小,同时忽略了模型性能,在一个这样的示例中,将语料库缩小了28.4%,同时将训练有素的逻辑回归模型的准确性提高了0.25%。在另一种情况下,该语料库的精度缩小了63.7%,精度降低了2.8%。这些有希望的结果表明,我们的方法可以为特定的NLP任务生成非常有效的止字字集。

Stopwords carry little semantic information and are often removed from text data to reduce dataset size and improve machine learning model performance. Consequently, researchers have sought to develop techniques for generating effective stopword sets. Previous approaches have ranged from qualitative techniques relying upon linguistic experts, to statistical approaches that extract word importance using correlations or frequency-dependent metrics computed on a corpus. We present a novel quantitative approach that employs iterative and recursive feature deletion algorithms to see which words can be deleted from a pre-trained transformer's vocabulary with the least degradation to its performance, specifically for the task of sentiment analysis. Empirically, stopword lists generated via this approach drastically reduce dataset size while negligibly impacting model performance, in one such example shrinking the corpus by 28.4% while improving the accuracy of a trained logistic regression model by 0.25%. In another instance, the corpus was shrunk by 63.7% with a 2.8% decrease in accuracy. These promising results indicate that our approach can generate highly effective stopword sets for specific NLP tasks.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源