用句法超图建模语义构图，以回答视频问题

论文标题

用句法超图建模语义构图，以回答视频问题

Modeling Semantic Composition with Syntactic Hypergraph for Video Question Answering

论文作者

Xu, Zenan, Zhong, Wanjun, Su, Qinliang, Ou, Zijing, Zhang, Fuwei

论文摘要

视频问题回答的一个关键挑战是如何实现文本概念和相应的视觉对象之间的跨模式语义对齐。现有方法主要试图使单词表示形式与视频区域保持一致。但是，单词表示通常无法传达对文本概念的完整描述，这些概念通常用某些单词的组成来描述。为了解决这个问题，我们建议使用现成的工具首先为每个问题构建句法依赖树，并使用它来指导有意义的单词组成的提取。基于提取的成分，通过将单词视为节点和组成为超增生，进一步构建了超图。然后使用HyperGraph卷积网络（HCN）学习单词组成的初始表示。之后，提出了一种基于最佳传输的方法来对文本和视觉语义空间进行跨模式的语义比对。为了反映交叉模式的影响，将跨模式信息纳入了初始表示形式，从而导致了一个名为“交叉模式感知的句法HCN”的模型。三个基准测试的实验结果表明，我们的方法的表现优于所有强基础。进一步的分析证明了每个组件的有效性，并表明我们的模型擅长建模不同级别的语义组成并滤除无关的信息。

A key challenge in video question answering is how to realize the cross-modal semantic alignment between textual concepts and corresponding visual objects. Existing methods mostly seek to align the word representations with the video regions. However, word representations are often not able to convey a complete description of textual concepts, which are in general described by the compositions of certain words. To address this issue, we propose to first build a syntactic dependency tree for each question with an off-the-shelf tool and use it to guide the extraction of meaningful word compositions. Based on the extracted compositions, a hypergraph is further built by viewing the words as nodes and the compositions as hyperedges. Hypergraph convolutional networks (HCN) are then employed to learn the initial representations of word compositions. Afterwards, an optimal transport based method is proposed to perform cross-modal semantic alignment for the textual and visual semantic space. To reflect the cross-modal influences, the cross-modal information is incorporated into the initial representations, leading to a model named cross-modality-aware syntactic HCN. Experimental results on three benchmarks show that our method outperforms all strong baselines. Further analyses demonstrate the effectiveness of each component, and show that our model is good at modeling different levels of semantic compositions and filtering out irrelevant information.

下载PDF全文

下载文献需遵守相关版权规定

论文标题