斜视VQA模型：带有子问题的内省VQA模型

论文标题

斜视VQA模型：带有子问题的内省VQA模型

SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions

论文作者

Selvaraju, Ramprasaath R., Tendulkar, Purva, Parikh, Devi, Horvitz, Eric, Ribeiro, Marco, Nushi, Besmira, Kamar, Ece

论文摘要

现有的VQA数据集包含具有不同复杂程度的问题。尽管这些数据集中的大多数问题都需要感知来识别实体的存在，属性和空间关系，但很大一部分问题构成了与推理任务相对应的挑战 - 只能通过对世界，逻辑和 /逻辑和 /或推理的感知和知识来回答的任务。分析在这种区别上的性能使我们能够注意到现有的VQA模型何时存在一致性问题；他们正确回答了推理问题，但由于相关的低级感知问题而失败。例如，在图1中，模型回答了复杂的推理问题：“香蕉是否足够食用？”正确地说，但在相关的感知问题上失败了“香蕉大部分是绿色还是黄色？”表明该模型可能正确地回答了推理问题，但出于错误的原因。我们通过创建VQA数据集的新推理拆分并收集VQA Introspect（一种新的DataSet1）来量化这种现象发生的程度，这是一个新的数据集1，它由238K新的知觉问题组成，作为对应于相对于知觉任务的子问题，需要有效地回答有关在推理分裂中回答复杂的理由问题所需的感知任务。我们的评估表明，最新的VQA模型在回答感知和推理问题时具有可比的性能，但遇到了一致性问题。为了解决这一缺点，我们提出了一种称为子问题重要性感知网络调整（斜视）的方法，该方法鼓励模型在回答推理问题和感知sub问题时参与图像的相同部分。我们表明，斜视将模型的一致性提高了约5％，同时在VQA中的推理问题上的性能略有提高，同时也显示了更好的注意力图。

Existing VQA datasets contain questions with varying levels of complexity. While the majority of questions in these datasets require perception for recognizing existence, properties, and spatial relationships of entities, a significant portion of questions pose challenges that correspond to reasoning tasks - tasks that can only be answered through a synthesis of perception and knowledge about the world, logic and / or reasoning. Analyzing performance across this distinction allows us to notice when existing VQA models have consistency issues; they answer the reasoning questions correctly but fail on associated low-level perception questions. For example, in Figure 1, models answer the complex reasoning question "Is the banana ripe enough to eat?" correctly, but fail on the associated perception question "Are the bananas mostly green or yellow?" indicating that the model likely answered the reasoning question correctly but for the wrong reason. We quantify the extent to which this phenomenon occurs by creating a new Reasoning split of the VQA dataset and collecting VQA-introspect, a new dataset1 which consists of 238K new perception questions which serve as sub questions corresponding to the set of perceptual tasks needed to effectively answer the complex reasoning questions in the Reasoning split. Our evaluation shows that state-of-the-art VQA models have comparable performance in answering perception and reasoning questions, but suffer from consistency problems. To address this shortcoming, we propose an approach called Sub-Question Importance-aware Network Tuning (SQuINT), which encourages the model to attend to the same parts of the image when answering the reasoning question and the perception sub question. We show that SQuINT improves model consistency by ~5%, also marginally improving performance on the Reasoning questions in VQA, while also displaying better attention maps.

下载PDF全文

下载文献需遵守相关版权规定

论文标题