Uniclam：对比代表学习与对抗性掩盖的统一医学视觉问题答案

论文标题

Uniclam：对比代表学习与对抗性掩盖的统一医学视觉问题答案

UnICLAM:Contrastive Representation Learning with Adversarial Masking for Unified and Interpretable Medical Vision Question Answering

论文作者

Zhan, Chenlu, Peng, Peng, Wang, Hongsen, Chen, Tao, Wang, Hongwei

论文摘要

医学视觉问题回答（医学VQA）旨在回答有关放射学图像的临床问题，从而帮助医生做出决策选择。然而，当前的医学VQA模型通过在双单个空间中居住的视觉和纹理编码来学习跨模式表示，这导致了间接的语义一致性。在本文中，我们提出了Uniclam，这是一种统一的，可解释的医学VQA模型，通过对比表示对抗性掩盖。具体而言，要学习一个对齐的图像文本表示，我们首先建立了统一的双流训练结构，并逐渐软参数共享策略。从技术上讲，所提出的策略了解了视觉和纹理编码在同一空间中的限制，该空间逐渐松散，随着较高数量的层数。此外，为了掌握统一的语义表示，我们以统一的方式将对抗性掩盖数据扩展扩展到对比的视觉和文本学习。具体而言，虽然编码器训练可最大程度地减少原始样品和掩盖样本之间的距离，但对抗性掩蔽模块使对抗性学习保持对抗性，从而相反地最大程度地提高了距离。此外，我们还直观地对统一的对抗性掩蔽增强模型进行了进一步的探索，从而提高了具有出色的性能和效率的潜在的前事态解释性。关于VQA-RAD和LAKE公共基准测试的实验结果表明，Uniclam的表现优于现有的11种最先进的医疗VQA模型。更重要的是，我们对Uniclam在诊断心力衰竭中表现的表现进行了额外的讨论，并验证了Uniclam在实践疾病诊断中表现出卓越的几次适应性表现。

Medical Visual Question Answering (Medical-VQA) aims to to answer clinical questions regarding radiology images, assisting doctors with decision-making options. Nevertheless, current Medical-VQA models learn cross-modal representations through residing vision and texture encoders in dual separate spaces, which lead to indirect semantic alignment. In this paper, we propose UnICLAM, a Unified and Interpretable Medical-VQA model through Contrastive Representation Learning with Adversarial Masking. Specifically, to learn an aligned image-text representation, we first establish a unified dual-stream pre-training structure with the gradually soft-parameter sharing strategy. Technically, the proposed strategy learns a constraint for the vision and texture encoders to be close in a same space, which is gradually loosened as the higher number of layers. Moreover, for grasping the unified semantic representation, we extend the adversarial masking data augmentation to the contrastive representation learning of vision and text in a unified manner. Concretely, while the encoder training minimizes the distance between original and masking samples, the adversarial masking module keeps adversarial learning to conversely maximize the distance. Furthermore, we also intuitively take a further exploration to the unified adversarial masking augmentation model, which improves the potential ante-hoc interpretability with remarkable performance and efficiency. Experimental results on VQA-RAD and SLAKE public benchmarks demonstrate that UnICLAM outperforms existing 11 state-of-the-art Medical-VQA models. More importantly, we make an additional discussion about the performance of UnICLAM in diagnosing heart failure, verifying that UnICLAM exhibits superior few-shot adaption performance in practical disease diagnosis.

下载PDF全文

下载文献需遵守相关版权规定

论文标题