论文标题
MedClip:从未配对的医学图像和文本中学习的对比度学习
MedCLIP: Contrastive Learning from Unpaired Medical Images and Text
论文作者
论文摘要
现有的视觉文本对比学习,例如剪辑,旨在匹配配对的图像和字幕嵌入,同时将其他人分开,从而提高表示的可转移性并支持零摄像的预测。但是,医疗图像文本数据集比Internet的一般图像和字幕低于数量级。此外,以前的方法遇到了许多虚假负面因素,即来自独立患者的图像和报告可能具有相同的语义,但被错误地视为负面因素。在本文中,我们将图像和文本解除了多模式对比度学习的图像和文本,从而以低成本的组合规模缩放了可用的训练数据。我们还建议基于医学知识来代替语义匹配损失,以消除对比度学习中的虚假负面因素。我们证明,MedClip是一个简单而有效的框架:它在零拍,监督分类和图像文本检索方面优于最先进的方法。令人惊讶的是,我们观察到,只有20K预训练数据,MedClip赢得了最先进的方法(使用大约200K数据)。我们的代码可在https://github.com/ryanwangzf/medclip上找到。
Existing vision-text contrastive learning like CLIP aims to match the paired image and caption embeddings while pushing others apart, which improves representation transferability and supports zero-shot prediction. However, medical image-text datasets are orders of magnitude below the general images and captions from the internet. Moreover, previous methods encounter many false negatives, i.e., images and reports from separate patients probably carry the same semantics but are wrongly treated as negatives. In this paper, we decouple images and texts for multimodal contrastive learning thus scaling the usable training data in a combinatorial magnitude with low cost. We also propose to replace the InfoNCE loss with semantic matching loss based on medical knowledge to eliminate false negatives in contrastive learning. We prove that MedCLIP is a simple yet effective framework: it outperforms state-of-the-art methods on zero-shot prediction, supervised classification, and image-text retrieval. Surprisingly, we observe that with only 20K pre-training data, MedCLIP wins over the state-of-the-art method (using around 200K data). Our code is available at https://github.com/RyanWangZf/MedCLIP.