与自然语言解释的视觉结合的块意见对齐和词汇约束

论文标题

与自然语言解释的视觉结合的块意见对齐和词汇约束

Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language Explanations

论文作者

Yang, Qian, Li, Yunxin, Hu, Baotian, Ma, Lin, Ding, Yuxing, Zhang, Min

论文摘要

与自然语言解释的视觉结合旨在推断文本图像对之间的关系，并生成一个句子来解释决策过程。以前的方法主要依靠预先训练的视觉语言模型来执行关系推断和语言模型来生成相应的解释。但是，预先训练的视觉模型主要在文本和图像之间建立令牌级别的对齐，但忽略了短语（块）和视觉内容之间的高级语义对齐，这对于视觉推理至关重要。此外，仅基于编码的联合表示形式的解释生成器并未明确考虑关键的关系推断的决策点。因此，产生的解释不太忠于视觉语言推理。为了减轻这些问题，我们提出了一种统一的块意见对齐和基于词汇约束的方法，称为CALEC。它包含一个块感知的语义交互器（ARR。CSI），一个关系属性和词汇约束感知的发生器（arr。Lecg）。具体而言，CSI利用语言和各种图像区域固有的句子结构来构建块感知的语义对齐。关系下属使用基于注意力的推理网络来合并令牌级别和块级视觉语言表示。 LECG利用词汇约束来将关系下列者重点的单词或块纳入解释产生中，从而提高了解释的忠诚和信息性。我们在三个数据集上进行了广泛的实验，实验结果表明，CALEC在推理准确性和生成的解释质量方面显着优于其他竞争者模型。

Visual Entailment with natural language explanations aims to infer the relationship between a text-image pair and generate a sentence to explain the decision-making process. Previous methods rely mainly on a pre-trained vision-language model to perform the relation inference and a language model to generate the corresponding explanation. However, the pre-trained vision-language models mainly build token-level alignment between text and image yet ignore the high-level semantic alignment between the phrases (chunks) and visual contents, which is critical for vision-language reasoning. Moreover, the explanation generator based only on the encoded joint representation does not explicitly consider the critical decision-making points of relation inference. Thus the generated explanations are less faithful to visual-language reasoning. To mitigate these problems, we propose a unified Chunk-aware Alignment and Lexical Constraint based method, dubbed as CALeC. It contains a Chunk-aware Semantic Interactor (arr. CSI), a relation inferrer, and a Lexical Constraint-aware Generator (arr. LeCG). Specifically, CSI exploits the sentence structure inherent in language and various image regions to build chunk-aware semantic alignment. Relation inferrer uses an attention-based reasoning network to incorporate the token-level and chunk-level vision-language representations. LeCG utilizes lexical constraints to expressly incorporate the words or chunks focused by the relation inferrer into explanation generation, improving the faithfulness and informativeness of the explanations. We conduct extensive experiments on three datasets, and experimental results indicate that CALeC significantly outperforms other competitor models on inference accuracy and quality of generated explanations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题