Winoground：探测视觉和语言模型的粘性语言组成性

论文标题

Winoground：探测视觉和语言模型的粘性语言组成性

Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

论文作者

Thrush, Tristan, Jiang, Ryan, Bartolo, Max, Singh, Amanpreet, Williams, Adina, Kiela, Douwe, Ross, Candace

论文摘要

我们提出了一项新颖的任务和数据集，用于评估视力和语言模型进行粘性语言作曲推理的能力，我们称之为Winoground。给定两个图像和两个字幕，目标是正确匹配它们 - 但至关重要的是，两个字幕都包含一组完全相同的单词，仅以不同的顺序。该数据集是由专家注释者仔细手工策划的，并标有一组丰富的细粒标签，以帮助分析模型性能。我们探讨了各种各样的最先进的视野和语言模型，并发现，令人惊讶的是，没有一个比机会更好。显然，这些模型并不像我们可能希望的那样熟练地熟练。我们进行了广泛的分析，以了解未来工作可能如何减轻这些模型的缺点的见解。我们的目标是Winoground作为一个有用的评估集，用于推进最新技术并在该领域的进一步进步。该数据集可在https://huggingface.co/datasets/facebook/winoground上找到。

We present a novel task and dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two images and two captions, the goal is to match them correctly - but crucially, both captions contain a completely identical set of words, only in a different order. The dataset was carefully hand-curated by expert annotators and is labeled with a rich set of fine-grained tags to assist in analyzing model performance. We probe a diverse range of state-of-the-art vision and language models and find that, surprisingly, none of them do much better than chance. Evidently, these models are not as skilled at visio-linguistic compositional reasoning as we might have hoped. We perform an extensive analysis to obtain insights into how future work might try to mitigate these models' shortcomings. We aim for Winoground to serve as a useful evaluation set for advancing the state of the art and driving further progress in the field. The dataset is available at https://huggingface.co/datasets/facebook/winoground.

下载PDF全文

下载文献需遵守相关版权规定

论文标题