论文标题
改善对COVID-19与Spark NLP研究的临床文档理解
Improving Clinical Document Understanding on COVID-19 Research with Spark NLP
论文作者
论文摘要
在全球Covid-19大流行之后,研究该病毒的科学论文数量已大大增加,从而增加了对自动识字审查的兴趣。我们提出了一个临床文本挖掘系统,该系统通过三种方式改善了以前的努力。首先,除其他常用的临床和生物医学实体外,它还可以识别超过100种不同的实体类型,包括健康,解剖学,危险因素和不良事件的社会决定因素。其次,文本处理管道包括断言状态检测,以区分存在,缺乏,有条件或其他患者以外的人的临床事实。第三,所使用的深度学习模型比以前可用的更准确,利用了最先进的命名命名实体识别模型的集成管道,并改善了先前的最佳性能基准以进行断言状态检测。我们说明了提取趋势和见解,例如来自COVID-19开放研究数据集(Cord-19)的最常见疾病和症状以及最常见的生命体征和心电图结果。该系统是使用Spark NLP库构建的,该库本地支持扩展以使用分布式簇,利用GPU,可配置和可重复使用的NLP管道,特定于医疗保健的嵌入方式以及训练模型以支持新实体类型或人类语言而没有代码更改的能力。
Following the global COVID-19 pandemic, the number of scientific papers studying the virus has grown massively, leading to increased interest in automated literate review. We present a clinical text mining system that improves on previous efforts in three ways. First, it can recognize over 100 different entity types including social determinants of health, anatomy, risk factors, and adverse events in addition to other commonly used clinical and biomedical entities. Second, the text processing pipeline includes assertion status detection, to distinguish between clinical facts that are present, absent, conditional, or about someone other than the patient. Third, the deep learning models used are more accurate than previously available, leveraging an integrated pipeline of state-of-the-art pretrained named entity recognition models, and improving on the previous best performing benchmarks for assertion status detection. We illustrate extracting trends and insights, e.g. most frequent disorders and symptoms, and most common vital signs and EKG findings, from the COVID-19 Open Research Dataset (CORD-19). The system is built using the Spark NLP library which natively supports scaling to use distributed clusters, leveraging GPUs, configurable and reusable NLP pipelines, healthcare specific embeddings, and the ability to train models to support new entity types or human languages with no code changes.