在Covid-19大流行的背景下，一种自我监督的语义索引方法

论文标题

在Covid-19大流行的背景下，一种自我监督的语义索引方法

A Self-supervised Approach for Semantic Indexing in the Context of COVID-19 Pandemic

论文作者

Ebadi, Nima, Najafirad, Peyman

论文摘要

大流行加速了199号科学论文的速度。此外，在当前的健康危机中，专家手动为这些论文分配语义索引的过程甚至更加耗时和压倒性。因此，迫切需要自动的语义索引模型，这些模型可以有效地扩展到新引入的概念和迅速发展的相关文献的分布。在这项研究中，我们提出了一种基于最先进的自我监督的表示和变压器编码的新型语义索引方法，专门用于大流行危机。我们介绍了一项基于在PubMed中发表和手动索引的新型数据集的案例研究。我们的研究表明，我们的自我监督模型的表现优于Micro-F1得分为0.1的BioASQ任务8A的最佳性能模型，而LCA-F得分平均为0.08。我们的模型还显示了检测补充概念的卓越性能，当文献的重点大大转向与大流行有关的特定概念时，这一点非常重要。我们的研究阐明了在大流行期间面临的语义索引模型面临的主要挑战，即其分布的新领域和急剧变化，以及作为这种情况的优越替代方案，提出了一种基于方法的模型，这些模型表明，在各种NLP任务中提高概括和数据效率方面表现出了吉祥的表现。我们还显示了主要医学主题标题（网格）和补充概念的联合索引可改善整体绩效。

The pandemic has accelerated the pace at which COVID-19 scientific papers are published. In addition, the process of manually assigning semantic indexes to these papers by experts is even more time-consuming and overwhelming in the current health crisis. Therefore, there is an urgent need for automatic semantic indexing models which can effectively scale-up to newly introduced concepts and rapidly evolving distributions of the hyperfocused related literature. In this research, we present a novel semantic indexing approach based on the state-of-the-art self-supervised representation learning and transformer encoding exclusively suitable for pandemic crises. We present a case study on a novel dataset that is based on COVID-19 papers published and manually indexed in PubMed. Our study shows that our self-supervised model outperforms the best performing models of BioASQ Task 8a by micro-F1 score of 0.1 and LCA-F score of 0.08 on average. Our model also shows superior performance on detecting the supplementary concepts which is quite important when the focus of the literature has drastically shifted towards specific concepts related to the pandemic. Our study sheds light on the main challenges confronting semantic indexing models during a pandemic, namely new domains and drastic changes of their distributions, and as a superior alternative for such situations, propose a model founded on approaches which have shown auspicious performance in improving generalization and data efficiency in various NLP tasks. We also show the joint indexing of major Medical Subject Headings (MeSH) and supplementary concepts improves the overall performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题