Mutatt：引用表达理解的视觉文本相互指导

论文标题

Mutatt：引用表达理解的视觉文本相互指导

MUTATT: Visual-Textual Mutual Guidance for Referring Expression Comprehension

论文作者

Wang, Shuai, Lyu, Fan, Feng, Wei, Wang, Song

论文摘要

参考表达理解（REC）旨在通过自然语言的引用表达方式将与文本相关的区域定位在给定图像中。现有方法着重于如何独立构建令人信服的视觉和语言表示，这可能会显着隔离视觉和语言信息。在本文中，我们认为，对于REC，参考表达和目标区域在语义相关性和主题，视觉和语言之间存在位置和关系一致性。在此之上，我们提出了一种称为Mutatt的新方法来构建视觉和语言之间的相互指导，这些方法和语言之间的视力和语言在同样地对待相匹配的信息。具体而言，对于每个主题，位置和关系模块，Mutatt建立了两种基于注意力的相互指导策略。一种策略是为了匹配相关的视觉功能，生成视觉引导的语言嵌入。另一个相反的人会生成语言引导的视觉功能，以匹配相关的语言嵌入。这种相互的指导策略可以有效保证三个模块的视觉语言一致性。在三个流行的REC数据集上的实验表明，所提出的方法的表现优于当前的最新方法。

Referring expression comprehension (REC) aims to localize a text-related region in a given image by a referring expression in natural language. Existing methods focus on how to build convincing visual and language representations independently, which may significantly isolate visual and language information. In this paper, we argue that for REC the referring expression and the target region are semantically correlated and subject, location and relationship consistency exist between vision and language.On top of this, we propose a novel approach called MutAtt to construct mutual guidance between vision and language, which treat vision and language equally thus yield compact information matching. Specifically, for each module of subject, location and relationship, MutAtt builds two kinds of attention-based mutual guidance strategies. One strategy is to generate vision-guided language embedding for the sake of matching relevant visual feature. The other reversely generates language-guided visual feature to match relevant language embedding. This mutual guidance strategy can effectively guarantees the vision-language consistency in three modules. Experiments on three popular REC datasets demonstrate that the proposed approach outperforms the current state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题