论文标题
自动化文档分类和遥远的监督以提高系统评价的效率
Automating Document Classification with Distant Supervision to Increase the Efficiency of Systematic Reviews
论文作者
论文摘要
目的:对学术文件的系统评价通常提供与研究问题相关的文献的完整和详尽的摘要。但是,做得好的系统评价是昂贵的,需要计时的和劳动力密集的。在这里,我们提出了一种自动文档分类方法,以大大减少审查文档的努力。方法:我们首先描述一个手动文档分类过程,该过程用于策划一个相关的培训数据集,然后提出三个分类器:一个关键字引导的方法,基于群集分析的精制方法以及使用大量功能令牌的随机森林方法。例如,这种方法用于识别研究女性性工作者的文件,这些文件被认为包含与艾滋病毒或暴力有关的内容。我们通过交叉验证比较了三个分类器的性能,并对训练模型中使用的数据部分进行灵敏度分析。结果:随机森林方法为接收器操作特征(ROC)和Precision/Recell(PR)提供了曲线下最高面积(AUC)。精确和回忆的分析表明,随机森林可以促进手动审查20 \%的文章,同时包含80 \%的相关情况。最后,我们发现可以使用相对较小的训练样本量获得一个好的分类器。结论:总而言之,此处介绍的文档分类程序的自动化程序可以提高系统评价的精度和效率,并促进实时审查,并定期更新评论。
Objective: Systematic reviews of scholarly documents often provide complete and exhaustive summaries of literature relevant to a research question. However, well-done systematic reviews are expensive, time-demanding, and labor-intensive. Here, we propose an automatic document classification approach to significantly reduce the effort in reviewing documents. Methods: We first describe a manual document classification procedure that is used to curate a pertinent training dataset and then propose three classifiers: a keyword-guided method, a cluster analysis-based refined method, and a random forest approach that utilizes a large set of feature tokens. As an example, this approach is used to identify documents studying female sex workers that are assumed to contain content relevant to either HIV or violence. We compare the performance of the three classifiers by cross-validation and conduct a sensitivity analysis on the portion of data utilized in training the model. Results: The random forest approach provides the highest area under the curve (AUC) for both receiver operating characteristic (ROC) and precision/recall (PR). Analyses of precision and recall suggest that random forest could facilitate manually reviewing 20\% of the articles while containing 80\% of the relevant cases. Finally, we found a good classifier could be obtained by using a relatively small training sample size. Conclusions: In sum, the automated procedure of document classification presented here could improve both the precision and efficiency of systematic reviews, as well as facilitating live reviews, where reviews are updated regularly.