论文标题

Mucko:基于事实的视觉问题回答的多层跨模式知识推理

Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering

论文作者

Zhu, Zihao, Yu, Jing, Wang, Yujing, Sun, Yajing, Hu, Yue, Wu, Qi

论文摘要

基于事实的视觉问题回答(FVQA)需要超越可见内容的外部知识来回答有关图像的问题,这是具有挑战性的,但对于实现一般VQA来说是必不可少的。现有FVQA解决方案的一个局限性是,它们共同嵌入了各种信息而没有精细选择,这引入了提出最终答案的意外声音。如何捕获面向提问的和信息互补的证据仍然是解决问题的关键挑战。在本文中,我们通过多模式异质图描绘了图像,该图包含与视觉,语义和事实特征相对应的多层信息。除了多层图表示外,我们提出了一个模态感知的异质图卷积网络,以捕获与给定问题最相关的不同层中的证据。具体而言,模式内图卷积从每种模态和跨模式图卷积从不同方式汇总的相关信息中选择证据。通过多次堆叠此过程,我们的模型执行迭代推理,并通过分析所有面向问题的证据来预测最佳答案。我们在FVQA任务上实现了新的最新性能,并通过广泛的实验证明了模型的有效性和解释性。

Fact-based Visual Question Answering (FVQA) requires external knowledge beyond visible content to answer questions about an image, which is challenging but indispensable to achieve general VQA. One limitation of existing FVQA solutions is that they jointly embed all kinds of information without fine-grained selection, which introduces unexpected noises for reasoning the final answer. How to capture the question-oriented and information-complementary evidence remains a key challenge to solve the problem. In this paper, we depict an image by a multi-modal heterogeneous graph, which contains multiple layers of information corresponding to the visual, semantic and factual features. On top of the multi-layer graph representations, we propose a modality-aware heterogeneous graph convolutional network to capture evidence from different layers that is most relevant to the given question. Specifically, the intra-modal graph convolution selects evidence from each modality and cross-modal graph convolution aggregates relevant information across different modalities. By stacking this process multiple times, our model performs iterative reasoning and predicts the optimal answer by analyzing all question-oriented evidence. We achieve a new state-of-the-art performance on the FVQA task and demonstrate the effectiveness and interpretability of our model with extensive experiments.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源