论文标题

从非结构化培养基中采矿不良药物反应

Mining Adverse Drug Reactions from Unstructured Mediums at Scale

论文作者

Haq, Hasham Ul, Kocaman, Veysel, Talby, David

论文摘要

不良药物反应 /事件(ADR / ADE)对患者健康和医疗保健费用有重大影响。尽早发现ADR并与监管机构,制药公司和医疗保健提供者共享它们可以防止发病率并挽救许多生命。尽管大多数ADR未通过正式渠道进行报告,但通常会记录在各种非结构化对话中,例如患者的社交媒体帖子,客户支持呼叫笔录或医疗保健提供者与制药销售代表之间会议的CRM笔记。在本文中,我们提出了一种自然语言处理(NLP)解决方案,该解决方案在此类非结构化的自由文本对话中检测到ADR,该对话以三种方式改善了先前的工作。首先,一个新的命名实体识别(NER)模型可在ADE,CADEC和SMM4H基准数据集(分别为91.75%,78.76%和83.41%的F1得分)上获得ADR和药物实体提取的新最先进的准确性)。其次,引入了两个新的关系提取(RE)模型 - 一种基于生物的基于生物的,而另一个利用完全连接的神经网络(FCNN)上的精制功能 - 显示出与现有最新模型相同的表现,并在接受补充临床临床肯定的RE DataSet进行培训时胜过它们。第三,一种新的文本分类模型,用于确定对话是否包括ADR,在CADEC数据集中获得了新的最新精度(86.69%的F1分数)。完整的解决方案在构建在Apache Spark顶部的生产级库中实现为统一的NLP管道,使其本地可扩展,并能够处理商品簇上的数百万批次或流式记录。

Adverse drug reactions / events (ADR/ADE) have a major impact on patient health and health care costs. Detecting ADR's as early as possible and sharing them with regulators, pharma companies, and healthcare providers can prevent morbidity and save many lives. While most ADR's are not reported via formal channels, they are often documented in a variety of unstructured conversations such as social media posts by patients, customer support call transcripts, or CRM notes of meetings between healthcare providers and pharma sales reps. In this paper, we propose a natural language processing (NLP) solution that detects ADR's in such unstructured free-text conversations, which improves on previous work in three ways. First, a new Named Entity Recognition (NER) model obtains new state-of-the-art accuracy for ADR and Drug entity extraction on the ADE, CADEC, and SMM4H benchmark datasets (91.75%, 78.76%, and 83.41% F1 scores respectively). Second, two new Relation Extraction (RE) models are introduced - one based on BioBERT while the other utilizing crafted features over a Fully Connected Neural Network (FCNN) - are shown to perform on par with existing state-of-the-art models, and outperform them when trained with a supplementary clinician-annotated RE dataset. Third, a new text classification model, for deciding if a conversation includes an ADR, obtains new state-of-the-art accuracy on the CADEC dataset (86.69% F1 score). The complete solution is implemented as a unified NLP pipeline in a production-grade library built on top of Apache Spark, making it natively scalable and able to process millions of batch or streaming records on commodity clusters.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源