论文标题
CLEVR-X:自然语言解释的视觉推理数据集
CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations
论文作者
论文摘要
在视觉问题回答(VQA)的背景下提供解释提出了机器学习中的基本问题。为了获得有关VQA生成自然语言解释过程的详细见解,我们介绍了大规模的CLEVR-X数据集,该数据集通过自然语言解释扩展了CLEVR数据集。对于CLEVR数据集中的每个图像问题对,CLEVR-X包含多个结构化的文本说明,这些解释是从原始场景图中得出的。通过构造,CLEVR-X解释是正确的,并描述回答给定问题所需的推理和视觉信息。我们进行了一项用户研究,以确认我们拟议的数据集中的基础真相解释确实是完整且相关的。我们提出了使用CLEVR-X数据集上的两个最新框架在VQA上生成自然语言解释的基线结果。此外,我们为不同的问答类型提供了解释生成质量的详细分析。此外,我们研究了使用不同数量的基础解释对自然语言产生(NLG)指标收敛的影响。 CLEVR-X数据集可在\ url {https://explainableml.github.io/clevr-x/}上公开获得。
Providing explanations in the context of Visual Question Answering (VQA) presents a fundamental problem in machine learning. To obtain detailed insights into the process of generating natural language explanations for VQA, we introduce the large-scale CLEVR-X dataset that extends the CLEVR dataset with natural language explanations. For each image-question pair in the CLEVR dataset, CLEVR-X contains multiple structured textual explanations which are derived from the original scene graphs. By construction, the CLEVR-X explanations are correct and describe the reasoning and visual information that is necessary to answer a given question. We conducted a user study to confirm that the ground-truth explanations in our proposed dataset are indeed complete and relevant. We present baseline results for generating natural language explanations in the context of VQA using two state-of-the-art frameworks on the CLEVR-X dataset. Furthermore, we provide a detailed analysis of the explanation generation quality for different question and answer types. Additionally, we study the influence of using different numbers of ground-truth explanations on the convergence of natural language generation (NLG) metrics. The CLEVR-X dataset is publicly available at \url{https://explainableml.github.io/CLEVR-X/}.