出版物层次分类中的算法标签：参考书目领域的评估和术语加权方法

论文标题

出版物层次分类中的算法标签：参考书目领域的评估和术语加权方法

Algorithmic labeling in hierarchical classifications of publications: Evaluation of bibliographic fields and term weighting approaches

论文作者

Sjögårde, Peter, Ahlgren, Per, Waltman, Ludo

论文摘要

研究出版物的算法分类可用于研究科学体系的许多不同方面，例如科学组织到领域，领域的成长，跨学科性和新兴主题。如何将这些分类中的类标记为文献中尚未彻底解决的问题。在这项研究中，我们评估了不同的方法在研究出版物的算法构建分类中标记类。我们专注于两个重要选择：（1）不同的书目字段和（2）加权术语相关性的不同方法。为了评估不同的选择，我们创建了两个基准：一个基于MEDLINE中的医学主题标题，另一个基于Science-Metrix期刊分类。我们测试了不同的方法在何种程度上产生了两个基准中类的所需标签。根据我们的结果，我们建议从标题和关键字中提取术语，以高水平的粒度（例如主题）标签类。在粒度较低的情况下（例如学科），我们建议从期刊名称和作者地址中提取条款。我们建议使用一种新方法，术语频率与特异性比率来计算术语的相关性。

Algorithmic classifications of research publications can be used to study many different aspects of the science system, such as the organization of science into fields, the growth of fields, interdisciplinarity, and emerging topics. How to label the classes in these classifications is a problem that has not been thoroughly addressed in the literature. In this study we evaluate different approaches to label the classes in algorithmically constructed classifications of research publications. We focus on two important choices: the choice of (1) different bibliographic fields and (2) different approaches to weight the relevance of terms. To evaluate the different choices, we created two baselines: one based on the Medical Subject Headings in MEDLINE and another based on the Science-Metrix journal classification. We tested to what extent different approaches yield the desired labels for the classes in the two baselines. Based on our results we recommend extracting terms from titles and keywords to label classes at high levels of granularity (e.g. topics). At low levels of granularity (e.g. disciplines) we recommend extracting terms from journal names and author addresses. We recommend the use of a new approach, term frequency to specificity ratio, to calculate the relevance of terms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题