单词和句子嵌入医学视觉问题回答的双重意见学习网络

论文标题

单词和句子嵌入医学视觉问题回答的双重意见学习网络

A Dual-Attention Learning Network with Word and Sentence Embedding for Medical Visual Question Answering

论文作者

Huang, Xiaofei, Gong, Hongfang

论文摘要

医学视觉问题回答（MVQA）的研究可以有助于计算机辅助诊断的发展。 MVQA是一项旨在根据给定的医学图像和相关的自然语言问题来预测准确且令人信服的答案的任务。此任务需要提取医学知识丰富的功能内容并对它们进行细粒度的理解。因此，构建有效的特征提取和理解方案是建模的关键。现有的MVQA问题提取方案主要集中于单词信息，忽略了文本中的医疗信息。同时，某些视觉和文本特征理解方案无法有效地捕获区域与关键字之间的相关性，以实现合理的视觉推理。在这项研究中，提出了一个带有单词和句子嵌入（WSDAN）的双重注意学习网络。我们设计一个带有句子嵌入（TSE）的变压器的模块，以提取包含关键字和医疗信息的问题的双重嵌入表示形式。提出了一个由自我注意力和引导注意力组成的双重意见学习（DAL）模块，以模拟密集的强化内和模式相互作用。使用多个DAL模块（DAL），学习视觉和文本共同注意力可以增加理解的粒度并改善视觉推理。 ImageClef 2019 VQA-MED（VQA-MED 2019）和VQA-RAD数据集的实验结果表明，我们所提出的方法的表现优于先前的最新方法。根据消融研究和Grad-CAM地图，WSDAN可以提取丰富的文本信息并具有强大的视觉推理能力。

Research in medical visual question answering (MVQA) can contribute to the development of computeraided diagnosis. MVQA is a task that aims to predict accurate and convincing answers based on given medical images and associated natural language questions. This task requires extracting medical knowledge-rich feature content and making fine-grained understandings of them. Therefore, constructing an effective feature extraction and understanding scheme are keys to modeling. Existing MVQA question extraction schemes mainly focus on word information, ignoring medical information in the text. Meanwhile, some visual and textual feature understanding schemes cannot effectively capture the correlation between regions and keywords for reasonable visual reasoning. In this study, a dual-attention learning network with word and sentence embedding (WSDAN) is proposed. We design a module, transformer with sentence embedding (TSE), to extract a double embedding representation of questions containing keywords and medical information. A dualattention learning (DAL) module consisting of self-attention and guided attention is proposed to model intensive intramodal and intermodal interactions. With multiple DAL modules (DALs), learning visual and textual co-attention can increase the granularity of understanding and improve visual reasoning. Experimental results on the ImageCLEF 2019 VQA-MED (VQA-MED 2019) and VQA-RAD datasets demonstrate that our proposed method outperforms previous state-of-the-art methods. According to the ablation studies and Grad-CAM maps, WSDAN can extract rich textual information and has strong visual reasoning ability.

下载PDF全文

下载文献需遵守相关版权规定

论文标题