编码器：图像文本检索的多样性敏感动量对比度学习

论文标题

编码器：图像文本检索的多样性敏感动量对比度学习

CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval

论文作者

Wang, Haoran, He, Dongliang, Wu, Wenhao, Xia, Boyang, Yang, Min, Li, Fu, Yu, Yunlong, Ji, Zhong, Ding, Errui, Wang, Jingdong

论文摘要

图像文本检索（ITR）在桥接视觉和舌形式方面具有挑战性。对比度学习已被大多数先前的艺术所采用。除了有限的负面图像文本对外，约束学习的能力受到手动加权负对以及对外部知识的不认识的限制。在本文中，我们提出了新型耦合多样性敏感的动量约束学习（编码器），以改善跨模式表示。首先，发明了一种新颖的多样性敏感性学习（DCL）体系结构。我们引入了两种模态的动态词典，以扩大图像文本对的规模，并且通过自适应负面的配对加权来实现多样性敏感性。此外，编码器设计了两个分支。一个人从图像/文本中学习实例级的嵌入式，它还基于其嵌入的输入图像/文本生成了伪在线聚类标签。同时，另一个分支学会从常识知识图中查询以形成两种模式的概念级描述符。之后，两个分支都利用DCL来对齐跨模式的嵌入空间，而额外的伪聚类标签预测损失则用于促进第二个分支的概念级表示学习。在两个流行的基准测试（即Mscoco和Flicker30k）上进行的广泛实验，验证编码器的表现明显优于最先进的方法。

Image-Text Retrieval (ITR) is challenging in bridging visual and lingual modalities. Contrastive learning has been adopted by most prior arts. Except for limited amount of negative image-text pairs, the capability of constrastive learning is restricted by manually weighting negative pairs as well as unawareness of external knowledge. In this paper, we propose our novel Coupled Diversity-Sensitive Momentum Constrastive Learning (CODER) for improving cross-modal representation. Firstly, a novel diversity-sensitive contrastive learning (DCL) architecture is invented. We introduce dynamic dictionaries for both modalities to enlarge the scale of image-text pairs, and diversity-sensitiveness is achieved by adaptive negative pair weighting. Furthermore, two branches are designed in CODER. One learns instance-level embeddings from image/text, and it also generates pseudo online clustering labels for its input image/text based on their embeddings. Meanwhile, the other branch learns to query from commonsense knowledge graph to form concept-level descriptors for both modalities. Afterwards, both branches leverage DCL to align the cross-modal embedding spaces while an extra pseudo clustering label prediction loss is utilized to promote concept-level representation learning for the second branch. Extensive experiments conducted on two popular benchmarks, i.e. MSCOCO and Flicker30K, validate CODER remarkably outperforms the state-of-the-art approaches.

下载PDF全文

下载文献需遵守相关版权规定

论文标题