论文标题
VLC-BERT:通过上下文化常识性知识回答的视觉问题
VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge
论文作者
论文摘要
人们对解决视觉问题回答(VQA)任务的兴趣越来越大,该任务要求模型超越图像中存在的内容。在这项工作中,我们专注于需要常识性推理的问题。与先前从静态知识基础中注入知识的方法相反,我们研究了使用常识变压器(Comet)(彗星)的上下文化知识的结合,这是一种接受人类策划知识基础训练的现有知识模型。我们提出了一种在新的预训练的视觉语言 - 符合符号变压器模型VLC-BERT中生成,选择和编码外平均知识以及视觉和文本提示的方法。通过对知识密集的OK-VQA和A-OKVQA数据集的评估,我们表明VLC-Bert能够超越利用静态知识库的现有模型。此外,通过详细的分析,我们解释了哪些问题受益,哪些问题是从彗星中的上下文化常识性知识中。
There has been a growing interest in solving Visual Question Answering (VQA) tasks that require the model to reason beyond the content present in the image. In this work, we focus on questions that require commonsense reasoning. In contrast to previous methods which inject knowledge from static knowledge bases, we investigate the incorporation of contextualized knowledge using Commonsense Transformer (COMET), an existing knowledge model trained on human-curated knowledge bases. We propose a method to generate, select, and encode external commonsense knowledge alongside visual and textual cues in a new pre-trained Vision-Language-Commonsense transformer model, VLC-BERT. Through our evaluation on the knowledge-intensive OK-VQA and A-OKVQA datasets, we show that VLC-BERT is capable of outperforming existing models that utilize static knowledge bases. Furthermore, through a detailed analysis, we explain which questions benefit, and which don't, from contextualized commonsense knowledge from COMET.