自动跨媒体检索的深度多模式图像文本嵌入

论文标题

自动跨媒体检索的深度多模式图像文本嵌入

Deep Multimodal Image-Text Embeddings for Automatic Cross-Media Retrieval

论文作者

Khojasteh, Hadi Abdi, Ansari, Ebrahim, Razzaghi, Parvin, Karimi, Akbar

论文摘要

本文通过学习用于跨模式检索的视觉文本嵌入空间来考虑匹配图像和句子的任务。找到这样的空间是一项艰巨的任务，因为文本和图像的特征和表示不可比拟。在这项工作中，我们介绍了一个端到端的深度多模式卷积式网络，用于同时学习视觉和语言表示，以推断图像文本相似性。该模型了解哪个对是匹配（正），哪些是使用基于铰链的三重态排名的不匹配（负）。为了了解联合表示形式，我们利用了我们从Twitter新提取的推文集合。我们数据集的主要特征是图像和推文的标准化与基准相同。此外，图片和推文之间可能存在较高的语义相关性，与描述良好的基准相反。 MS-Coco基准数据集的实验结果表明，我们的模型优于前面提出的某些方法，并且与最先进的方法相比具有竞争性能。代码和数据集已公开提供。

This paper considers the task of matching images and sentences by learning a visual-textual embedding space for cross-modal retrieval. Finding such a space is a challenging task since the features and representations of text and image are not comparable. In this work, we introduce an end-to-end deep multimodal convolutional-recurrent network for learning both vision and language representations simultaneously to infer image-text similarity. The model learns which pairs are a match (positive) and which ones are a mismatch (negative) using a hinge-based triplet ranking. To learn about the joint representations, we leverage our newly extracted collection of tweets from Twitter. The main characteristic of our dataset is that the images and tweets are not standardized the same as the benchmarks. Furthermore, there can be a higher semantic correlation between the pictures and tweets contrary to benchmarks in which the descriptions are well-organized. Experimental results on MS-COCO benchmark dataset show that our model outperforms certain methods presented previously and has competitive performance compared to the state-of-the-art. The code and dataset have been made available publicly.

下载PDF全文

下载文献需遵守相关版权规定

论文标题