论文标题

条形图栅格图像的自动数据提取

Automated data extraction of bar chart raster images

论文作者

Carderas, Alex, Yuan, Ye, Livnat, Itamar, Yanagihara, Ryan, Saul, Rosita, De Oca, Gabrielle Montes, Zheng, Kai, Browne, Andrew W.

论文摘要

目的:开发使用光学特征识别的软件来自动从条形图中进行荟萃分析的数据。方法:我们使用了一种多步数据提取方法,其中包括图形提取,文本检测和图像拆卸。以这种方式处理的PubMed Central论文包括有关黄斑变性的临床试验,一种疾病,导致疾病负担重大和许多临床试验。条形图特征是以自动和手动方式提取的。然后比较了这两种方法的准确性。然后使用平淡的altman分析比较这些特征。结果:基于平淡的阿尔特曼分析,91.8%的数据点在一致的范围内。通过将我们的自动数据提取与手动数据提取进行比较,自动数据提取得出以下精度:X轴标签79.5%,Y键值88.6%,Y轴标签88.6%,条值<5%错误<5%错误88.0%。讨论:根据我们的分析,我们达成了自动数据提取和手动数据提取之间的一致性。误差的主要来源是通过光学特征识别库对7s的描述为2s。我们还将以深度神经网络的形式添加冗余检查,从而提高我们的条形检测准确性。对此方法进行进一步的改进是合理的,以提取列表和线图数据,以促进自动数据收集进行荟萃分析。

Objective: To develop software utilizing optical character recognition toward the automatic extraction of data from bar charts for meta-analysis. Methods: We utilized a multistep data extraction approach that included figure extraction, text detection, and image disassembly. PubMed Central papers that were processed in this manner included clinical trials regarding macular degeneration, a disease causing blindness with a heavy disease burden and many clinical trials. Bar chart characteristics were extracted in both an automated and manual fashion. These two approaches were then compared for accuracy. These characteristics were then compared using a Bland-Altman analysis. Results: Based on Bland-Altman analysis, 91.8% of data points were within the limits of agreement. By comparing our automated data extraction with manual data extraction, automated data extraction yielded the following accuracies: X-axis labels 79.5%, Y-tick values 88.6%, Y-axis label 88.6%, Bar value <5% error 88.0%. Discussion: Based on our analysis, we achieved an agreement between automated data extraction and manual data extraction. A major source of error was the incorrect delineation of 7s as 2s by optical character recognition library. We also would benefit from adding redundancy checks in the form of a deep neural network to boost our bar detection accuracy. Further refinements to this method are justified to extract tabulated and line graph data to facilitate automated data gathering for meta-analysis.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源