图像字幕的检索提示变压器

论文标题

图像字幕的检索提示变压器

Retrieval-Augmented Transformer for Image Captioning

论文作者

Sarto, Sara, Cornia, Marcella, Baraldi, Lorenzo, Cucchiara, Rita

论文摘要

图像字幕模型旨在通过提供输入图像的自然语言描述来连接视觉和语言。在过去的几年中，通过学习参数模型并提出视觉特征提取的进步或建模更好的多模式连接来解决该任务。在本文中，我们研究了使用KNN记忆的图像字幕方法的开发，可以从外部语料库中检索知识以帮助生成过程。我们的体系结构将基于视觉相似性，可区分编码器和KNN-扬声的注意力层的知识检索器结合在一起，以根据过去的上下文和从外部内存中检索到的文本进行预测令牌。在可可数据集上进行的实验结果表明，采用明确的外部记忆可以帮助生成过程并提高字幕质量。我们的工作开辟了新的途径，以更大规模改善图像字幕模型。

Image captioning models aim at connecting Vision and Language by providing natural language descriptions of input images. In the past few years, the task has been tackled by learning parametric models and proposing visual feature extraction advancements or by modeling better multi-modal connections. In this paper, we investigate the development of an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process. Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens based on the past context and on text retrieved from the external memory. Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality. Our work opens up new avenues for improving image captioning models at larger scale.

下载PDF全文

下载文献需遵守相关版权规定

论文标题