带有二进制和连续标签监督的图像文本检索

论文标题

带有二进制和连续标签监督的图像文本检索

Image-Text Retrieval with Binary and Continuous Label Supervision

论文作者

Li, Zheng, Guo, Caili, Feng, Zerun, Hwang, Jenq-Neng, Jin, Ying, Zhang, Yufeng

论文摘要

大多数图像文本检索工作都采用二进制标签，指示一对图像和文本是否匹配。这样的二进制指标仅涵盖图像文本语义关系的有限子集，这不足以表示图像和连续标签（例如图像标题）所描述的图像和文本之间的相关程度。通过学习二进制标签获得的视觉语义嵌入空间是不连贯的，无法完全表征相关程度。除了使用二进制标签外，本文还结合了连续的伪标签（通常通过字幕之间的文本相似性近似）来指示相关程度。为了学习一个连贯的嵌入空间，我们提出了一个带有二元和连续标签监督（BCLS）的图像文本检索框架，其中使用二进制标签来指导检索模型以学习有限的二元相关性，并且连续标签与图像 - Text语义关系的学习是互补的。为了学习二进制标签，我们通过软性挖掘（三重态SN）改善了共同的三重态排名损失，以改善收敛性。为了学习连续标签，我们设计了受Kendall等级相关系数（Kendall）启发的Kendall排名损失，该损失改善了检索模型和连续标签所预测的相似性分数之间的相关性。为了减轻连续伪标签引入的噪声，我们进一步设计了滑动窗口采样和硬采样策略（SW-HS），以减轻噪声的影响，并将框架的复杂性降低到与三胞胎排名损失相同的数量级。在两个图像文本检索基准上进行的广泛实验表明，我们的方法可以改善最先进的图像文本检索模型的性能。

Most image-text retrieval work adopts binary labels indicating whether a pair of image and text matches or not. Such a binary indicator covers only a limited subset of image-text semantic relations, which is insufficient to represent relevance degrees between images and texts described by continuous labels such as image captions. The visual-semantic embedding space obtained by learning binary labels is incoherent and cannot fully characterize the relevance degrees. In addition to the use of binary labels, this paper further incorporates continuous pseudo labels (generally approximated by text similarity between captions) to indicate the relevance degrees. To learn a coherent embedding space, we propose an image-text retrieval framework with Binary and Continuous Label Supervision (BCLS), where binary labels are used to guide the retrieval model to learn limited binary correlations, and continuous labels are complementary to the learning of image-text semantic relations. For the learning of binary labels, we improve the common Triplet ranking loss with Soft Negative mining (Triplet-SN) to improve convergence. For the learning of continuous labels, we design Kendall ranking loss inspired by Kendall rank correlation coefficient (Kendall), which improves the correlation between the similarity scores predicted by the retrieval model and the continuous labels. To mitigate the noise introduced by the continuous pseudo labels, we further design Sliding Window sampling and Hard Sample mining strategy (SW-HS) to alleviate the impact of noise and reduce the complexity of our framework to the same order of magnitude as the triplet ranking loss. Extensive experiments on two image-text retrieval benchmarks demonstrate that our method can improve the performance of state-of-the-art image-text retrieval models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题