论文标题

在医学诊断中人类和机器感知之间的差异

Differences between human and machine perception in medical diagnosis

论文作者

Makino, Taro, Jastrzebski, Stanislaw, Oleszkiewicz, Witold, Chacko, Celin, Ehrenpreis, Robin, Samreen, Naziya, Chhor, Chloe, Kim, Eric, Lee, Jiyon, Pysarenko, Kristine, Reig, Beatriu, Toth, Hildegard, Awal, Divya, Du, Linda, Kim, Alice, Park, James, Sodickson, Daniel K., Heacock, Laura, Moy, Linda, Cho, Kyunghyun, Geras, Krzysztof J.

论文摘要

深神经网络(DNN)在基于图像的医学诊断中显示出希望,但不能完全信任,因为它们的性能会因人类感知仍然不变的数据集转移而严重降低。如果我们能够更好地理解人类和机器感知之间的差异,我们就可以潜在地表征和减轻这种效果。因此,我们提出了一个比较医学诊断中人类和机器感知的框架。将两者相比,将它们对去除临床上有意义的信息的敏感性以及被认为最可疑的图像区域进行比较。从自然图像域中汲取灵感,我们在扰动鲁棒性方面进行了两个比较。我们框架的新颖性是,对具有临床意义差异的亚组进行了单独的分析。我们认为这是为了避免辛普森的悖论并得出正确的结论是必要的。我们通过乳腺癌筛查中的案例研究来证明我们的框架,并揭示了放射科医生和DNN之间的显着差异。我们将两者相对于高斯低通滤波进行比较,对微钙化和软组织病变进行亚组分析。对于微钙化,DNN使用一组单独的高频组件,而不是放射科医生,其中一些位于放射科医生认为最可疑的图像区域之外。这些功能有伪造的风险,但如果没有,则可能代表潜在的新生物标志物。对于软组织病变,放射科医生和DNN之间的差异甚至是鲜明的,DNN在很大程度上依赖于放射科医生忽略的虚假高频成分。重要的是,只有通过亚组分析可以观察到软组织病变的这种偏差,这突出了将医疗领域知识纳入我们的比较框架的重要性。

Deep neural networks (DNNs) show promise in image-based medical diagnosis, but cannot be fully trusted since their performance can be severely degraded by dataset shifts to which human perception remains invariant. If we can better understand the differences between human and machine perception, we can potentially characterize and mitigate this effect. We therefore propose a framework for comparing human and machine perception in medical diagnosis. The two are compared with respect to their sensitivity to the removal of clinically meaningful information, and to the regions of an image deemed most suspicious. Drawing inspiration from the natural image domain, we frame both comparisons in terms of perturbation robustness. The novelty of our framework is that separate analyses are performed for subgroups with clinically meaningful differences. We argue that this is necessary in order to avert Simpson's paradox and draw correct conclusions. We demonstrate our framework with a case study in breast cancer screening, and reveal significant differences between radiologists and DNNs. We compare the two with respect to their robustness to Gaussian low-pass filtering, performing a subgroup analysis on microcalcifications and soft tissue lesions. For microcalcifications, DNNs use a separate set of high frequency components than radiologists, some of which lie outside the image regions considered most suspicious by radiologists. These features run the risk of being spurious, but if not, could represent potential new biomarkers. For soft tissue lesions, the divergence between radiologists and DNNs is even starker, with DNNs relying heavily on spurious high frequency components ignored by radiologists. Importantly, this deviation in soft tissue lesions was only observable through subgroup analysis, which highlights the importance of incorporating medical domain knowledge into our comparison framework.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源