论文标题

基于新的功能集

Quasi Error-free Text Classification and Authorship Recognition in a large Corpus of English Literature based on a Novel Feature Set

论文作者

Jacobs, Arthur M., Kinder, Annette

论文摘要

古腾堡文学英语语料库(GLEC)为数字人文,计算语言学或神经认知诗学提供了丰富的文本数据来源。但是,到目前为止,只有一个小的子科普斯(Gutenberg English)诗歌语料库已被提交给定量文本分析,为文学科学研究提供了预测。在这里,我们表明,在整个GLEC准准文本分类中,使用相同的五个样式和五个内容功能的方法可以通过样式和情感分析计算出相同的五个样式和五个内容功能。我们的结果将两个标准和两个新颖的特征(即类型式比率,频率,超声评分,惊喜)确定为这些任务中的大多数诊断。通过提供适用于短诗和长篇小说的简单工具,它们对特定文本类别或作者的认知和情感处理的特征产生定量预测,我们的数据为阅读心理学中文学或实验的许多未来计算和经验研究铺平了道路。

The Gutenberg Literary English Corpus (GLEC) provides a rich source of textual data for research in digital humanities, computational linguistics or neurocognitive poetics. However, so far only a small subcorpus, the Gutenberg English Poetry Corpus, has been submitted to quantitative text analyses providing predictions for scientific studies of literature. Here we show that in the entire GLEC quasi error-free text classification and authorship recognition is possible with a method using the same set of five style and five content features, computed via style and sentiment analysis, in both tasks. Our results identify two standard and two novel features (i.e., type-token ratio, frequency, sonority score, surprise) as most diagnostic in these tasks. By providing a simple tool applicable to both short poems and long novels generating quantitative predictions about features that co-determe the cognitive and affective processing of specific text categories or authors, our data pave the way for many future computational and empirical studies of literature or experiments in reading psychology.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源