对齐：基于对齐关系的视觉构成识别

论文标题

对齐：基于对齐关系的视觉构成识别

AlignVE: Visual Entailment Recognition Based on Alignment Relations

论文作者

Cao, Biwei, Cao, Jiuxin, Gui, Jie, Shen, Jiayun, Liu, Bo, He, Lei, Tang, Yuan Yan, Kwok, James Tin-Yau

论文摘要

视觉构成（VE）是要认识到假设文本的语义是否可以从给定的前提图像中推断出来，这是最近出现的视觉和语言理解任务中的一项特殊任务。当前，大多数现有的VE方法均来自视觉问题回答的方法。他们通过量化多模态的内容语义特征中的假设和前提之间的相似性来识别视觉上的需求。但是，这种方法忽略了VE的独特性质，即前提和假设之间的关系推断。因此，在本文中，提出了一种称为Arignve的新体系结构，以通过关系相互作用方法解决视觉上的问题。它将前提与假设之间的关系建模为对齐矩阵。然后，它引入了汇总操作，以获取具有固定尺寸的特征向量。最后，它穿过完全连接的层和归一化层以完成分类。实验表明，我们基于对齐的架构在SNLI-VE数据集上达到72.45 \％的精度，在相同的设置下表现优于先前基于内容的模型。

Visual entailment (VE) is to recognize whether the semantics of a hypothesis text can be inferred from the given premise image, which is one special task among recent emerged vision and language understanding tasks. Currently, most of the existing VE approaches are derived from the methods of visual question answering. They recognize visual entailment by quantifying the similarity between the hypothesis and premise in the content semantic features from multi modalities. Such approaches, however, ignore the VE's unique nature of relation inference between the premise and hypothesis. Therefore, in this paper, a new architecture called AlignVE is proposed to solve the visual entailment problem with a relation interaction method. It models the relation between the premise and hypothesis as an alignment matrix. Then it introduces a pooling operation to get feature vectors with a fixed size. Finally, it goes through the fully-connected layer and normalization layer to complete the classification. Experiments show that our alignment-based architecture reaches 72.45\% accuracy on SNLI-VE dataset, outperforming previous content-based models under the same settings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题