论文标题
从放射学报告中使用自动标签的胶质瘤变化检测的自动标签弱监督的学习
Weakly Supervised Learning with Automated Labels from Radiology Reports for Glioma Change Detection
论文作者
论文摘要
神经胶质瘤是成年人中最常见的原发性脑肿瘤。胶质瘤变化检测旨在找到随着时间的流逝而变化的图像的相关部分。尽管深度学习(DL)在类似的更改检测任务中显示出令人鼓舞的表现,但大型注释数据集的创建代表了放射学中监督DL应用的主要瓶颈。为了克服这一点,我们提出了弱标签(不精确,但快速撰写的注释)和转移学习(TL)的联合使用。具体而言,我们探索了归纳性TL,其中源和目标域是相同的,但是由于标签移位而导致的任务不同:我们的目标标签是由三个放射科医师手动创建的,而我们的源弱标签是通过NLP自动从放射学报告中生成的。我们将知识转移作为超参数优化,从而避免了相关工作中经常出现的启发式选择。我们研究了模型大小和TL之间的关系,将低容量VGG与较高容量的RESNEXT模型进行了比较。我们通过根据肿瘤进化将其分类为稳定或不稳定的1693个T2加权磁共振成像差图评估了我们的模型。从放射学报告中提取的弱标签使我们能够将数据集大小提高3倍以上,并将VGG分类结果从75%提高到82%的AUC。从头开始的混合训练导致比微调或特征提取更高的性能。为了评估可推广性,我们对开放数据集进行了推断(Brats-2015:15例患者,51个差异图),达到了高达76%的AUC。总体而言,结果表明,医学成像问题可能受益于较小的模型和有关计算机视觉数据集的不同TL策略,并且报告生成的弱标签可有效改善模型性能。释放代码,内部数据集和Brats标签。
Gliomas are the most frequent primary brain tumors in adults. Glioma change detection aims at finding the relevant parts of the image that change over time. Although Deep Learning (DL) shows promising performances in similar change detection tasks, the creation of large annotated datasets represents a major bottleneck for supervised DL applications in radiology. To overcome this, we propose a combined use of weak labels (imprecise, but fast-to-create annotations) and Transfer Learning (TL). Specifically, we explore inductive TL, where source and target domains are identical, but tasks are different due to a label shift: our target labels are created manually by three radiologists, whereas our source weak labels are generated automatically from radiology reports via NLP. We frame knowledge transfer as hyperparameter optimization, thus avoiding heuristic choices that are frequent in related works. We investigate the relationship between model size and TL, comparing a low-capacity VGG with a higher-capacity ResNeXt model. We evaluate our models on 1693 T2-weighted magnetic resonance imaging difference maps created from 183 patients, by classifying them into stable or unstable according to tumor evolution. The weak labels extracted from radiology reports allowed us to increase dataset size more than 3-fold, and improve VGG classification results from 75% to 82% AUC. Mixed training from scratch led to higher performance than fine-tuning or feature extraction. To assess generalizability, we ran inference on an open dataset (BraTS-2015: 15 patients, 51 difference maps), reaching up to 76% AUC. Overall, results suggest that medical imaging problems may benefit from smaller models and different TL strategies with respect to computer vision datasets, and that report-generated weak labels are effective in improving model performances. Code, in-house dataset and BraTS labels are released.