伪Q：生成可视接地的伪语言查询

论文标题

伪Q：生成可视接地的伪语言查询

Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding

论文作者

Jiang, Haojun, Lin, Yuanze, Han, Dongchen, Song, Shiji, Huang, Gao

论文摘要

视觉接地，即根据自然语言查询将对象定位在图像中，是视觉语言理解中的重要主题。该任务的最有效方法是基于深度学习，这些学习通常需要昂贵的手动标记图像问题或补丁对。为了消除对人类注释的严重依赖，我们提出了一种名为伪Q的新方法，以自动生成伪语言查询以进行监督培训。我们的方法利用了现成的对象检测器从未标记的图像中识别可视化对象，然后以伪Query-Query Generation Modenule以无监督的方式获得这些对象的语言查询。然后，我们设计了一个与任务相关的查询提示模块，以专门针对视觉接地任务量身定制生成的伪语言查询。此外，为了完全捕获图像和语言查询之间的上下文关系，我们开发了一个配备了多级跨模式注意机制的视觉语言模型。广泛的实验结果表明，我们的方法具有两个值得注意的好处：（1）它可以大大降低人类注释成本，例如，在不降低原始模型在完全有监督的环境下的repcoco的31％，而没有铃铛和哨声，它可以实现较高的或可比较的效果，而不是在五个弱化的视觉基础上，它可以实现较高的或可比的效果。代码可在https://github.com/leaplabthu/pseudo-q上找到。

Visual grounding, i.e., localizing objects in images according to natural language queries, is an important topic in visual language understanding. The most effective approaches for this task are based on deep learning, which generally require expensive manually labeled image-query or patch-query pairs. To eliminate the heavy dependence on human annotations, we present a novel method, named Pseudo-Q, to automatically generate pseudo language queries for supervised training. Our method leverages an off-the-shelf object detector to identify visual objects from unlabeled images, and then language queries for these objects are obtained in an unsupervised fashion with a pseudo-query generation module. Then, we design a task-related query prompt module to specifically tailor generated pseudo language queries for visual grounding tasks. Further, in order to fully capture the contextual relationships between images and language queries, we develop a visual-language model equipped with multi-level cross-modality attention mechanism. Extensive experimental results demonstrate that our method has two notable benefits: (1) it can reduce human annotation costs significantly, e.g., 31% on RefCOCO without degrading original model's performance under the fully supervised setting, and (2) without bells and whistles, it achieves superior or comparable performance compared to state-of-the-art weakly-supervised visual grounding methods on all the five datasets we have experimented. Code is available at https://github.com/LeapLabTHU/Pseudo-Q.

下载PDF全文

下载文献需遵守相关版权规定

论文标题