NLX-GPT：视觉和视觉任务中自然语言解释的模型

论文标题

NLX-GPT：视觉和视觉任务中自然语言解释的模型

NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks

论文作者

Sammani, Fawaz, Mukherjee, Tanmoy, Deligiannis, Nikos

论文摘要

自然语言解释（NLE）模型旨在通过产生对人类友好，高级和细粒度的自然语言句子来解释黑匣子系统的决策过程。当前的NLE模型解释了视觉或视觉语言模型（又称任务模型）的决策过程，例如，通过语言模型（又称说明模型），例如GPT。除了任务模型所需的其他内存资源和推理时间外，任务和解释模型是完全独立的，这使解释与预测答案的推理过程相关。我们介绍了NLX-GPT，这是一种通用，紧凑而忠实的语言模型，可以同时预测答案并解释它。我们首先对图像捕获对的大规模数据进行预训练，以便对图像进行一般理解，然后将答案作为文本预测任务以及解释提出。如果没有区域建议或任务模型，我们的总体框架将获得更好的评估分数，包含的参数要少得多，并且比当前的SOA模型快15 $ \ times $。然后，我们解决了评估可能在多次通用，数据偏向并可能有几种形式的解释的问题。因此，我们设计了2种新的评估措施：（1）解释预测和（2）基于检索的攻击，这是一个不需要标签的自我评估框架。代码为：https：//github.com/fawazsammani/nlxgpt。

Natural language explanation (NLE) models aim at explaining the decision-making process of a black box system via generating natural language sentences which are human-friendly, high-level and fine-grained. Current NLE models explain the decision-making process of a vision or vision-language model (a.k.a., task model), e.g., a VQA model, via a language model (a.k.a., explanation model), e.g., GPT. Other than the additional memory resources and inference time required by the task model, the task and explanation models are completely independent, which disassociates the explanation from the reasoning process made to predict the answer. We introduce NLX-GPT, a general, compact and faithful language model that can simultaneously predict an answer and explain it. We first conduct pre-training on large scale data of image-caption pairs for general understanding of images, and then formulate the answer as a text prediction task along with the explanation. Without region proposals nor a task model, our resulting overall framework attains better evaluation scores, contains much less parameters and is 15$\times$ faster than the current SoA model. We then address the problem of evaluating the explanations which can be in many times generic, data-biased and can come in several forms. We therefore design 2 new evaluation measures: (1) explain-predict and (2) retrieval-based attack, a self-evaluation framework that requires no labels. Code is at: https://github.com/fawazsammani/nlxgpt.

下载PDF全文

下载文献需遵守相关版权规定

论文标题