关于语言和愿景推理的跨模式相关性

论文标题

关于语言和愿景推理的跨模式相关性

Cross-Modality Relevance for Reasoning on Language and Vision

论文作者

Zheng, Chen, Guo, Quan, Kordjamshidi, Parisa

论文摘要

这项工作涉及有关相关下游任务的语言和视觉数据的学习和推理的挑战，例如视觉问题回答（VQA）和自然语言的视觉推理（NLVR）。我们设计了一个新颖的跨模式相关模块，该模块用于端到端框架中，以在目标任务的监督下学习各种输入模式的组件之间的相关性表示，而与仅重塑原始表示空间相比，这对于未观察到的数据更为普遍。除了建模文本实体和视觉实体之间的相关性外，我们还对图像中的实体关系与对象关系中的实体关系之间的高阶相关性进行建模。我们提出的方法使用公共基准在两个不同的语言和视觉任务上表现出竞争性能，并改善了最新发布的结果。 NLVR任务的输入空间及其相关性表示的学习对齐促进了VQA任务的训练效率。

This work deals with the challenge of learning and reasoning over language and vision data for the related downstream tasks such as visual question answering (VQA) and natural language for visual reasoning (NLVR). We design a novel cross-modality relevance module that is used in an end-to-end framework to learn the relevance representation between components of various input modalities under the supervision of a target task, which is more generalizable to unobserved data compared to merely reshaping the original representation space. In addition to modeling the relevance between the textual entities and visual entities, we model the higher-order relevance between entity relations in the text and object relations in the image. Our proposed approach shows competitive performance on two different language and vision tasks using public benchmarks and improves the state-of-the-art published results. The learned alignments of input spaces and their relevance representations by NLVR task boost the training efficiency of VQA task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题