大规模双向训练零拍图像字幕

论文标题

大规模双向训练零拍图像字幕

Large-Scale Bidirectional Training for Zero-Shot Image Captioning

论文作者

Kim, Taehoon, Marsden, Mark, Ahn, Pyunghwan, Kim, Sangyun, Lee, Sihaeng, Sala, Alessandra, Kim, Seung Hwan

论文摘要

在大规模数据集中接受培训时，图像字幕模型可以从通用域中理解图像的内容，但通常无法生成准确的详细字幕。为了提高性能，预处理和重新调整是图像字幕的关键策略。但是，我们发现图像和文本之间的大规模双向训练可实现零拍图像字幕。在本文中，我们介绍了大规模，苦味，有效的训练和推理框架的双向图像文本培训，以零拍图像字幕。我们还提出了一个新的评估基准，该基准包括高质量的数据集和一组广泛的指标，以正确评估零拍的字幕准确性和社会偏见。我们还为关键字提取提供了有效的列出方法。我们表明，仔细选择大规模训练集和模型体系结构是实现零照片图像字幕的关键。

When trained on large-scale datasets, image captioning models can understand the content of images from a general domain but often fail to generate accurate, detailed captions. To improve performance, pretraining-and-finetuning has been a key strategy for image captioning. However, we find that large-scale bidirectional training between image and text enables zero-shot image captioning. In this paper, we introduce Bidirectional Image Text Training in largER Scale, BITTERS, an efficient training and inference framework for zero-shot image captioning. We also propose a new evaluation benchmark which comprises of high quality datasets and an extensive set of metrics to properly evaluate zero-shot captioning accuracy and societal bias. We additionally provide an efficient finetuning approach for keyword extraction. We show that careful selection of large-scale training set and model architecture is the key to achieving zero-shot image captioning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题