视频问题回答的分层条件关系网络

论文标题

视频问题回答的分层条件关系网络

Hierarchical Conditional Relation Networks for Video Question Answering

论文作者

Le, Thao Minh, Le, Vuong, Venkatesh, Svetha, Tran, Truyen

论文摘要

视频问题回答（videoqa）具有挑战性，因为它需要建模能力来提炼动态视觉伪像和遥远的关系，并将它们与语言概念联系起来。我们引入了一个通用可重复使用的神经单元，称为条件关系网络（CRN），该单元是一个构造块，以构建更复杂的结构来表示和推理视频。 CRN作为输入的一系列张力对象和调节功能，并计算一组编码的输出对象。模型构建成为一种简单的复制，重新排列和堆叠这些可重复使用单元的行动，以实现各种方式和上下文信息。因此，该设计支持高阶的关系和多步骤推理。 VideoQA的最终体系结构是一个CRN层次结构，其分支代表子视频或剪辑，所有这些都与上下文条件共享了相同的问题。我们对著名数据集的评估取得了新的SOTA结果，证明了建立通用推理单元对复杂域（例如VideoQA）的影响。

Video question answering (VideoQA) is challenging as it requires modeling capacity to distill dynamic visual artifacts and distant relations and to associate them with linguistic concepts. We introduce a general-purpose reusable neural unit called Conditional Relation Network (CRN) that serves as a building block to construct more sophisticated structures for representation and reasoning over video. CRN takes as input an array of tensorial objects and a conditioning feature, and computes an array of encoded output objects. Model building becomes a simple exercise of replication, rearrangement and stacking of these reusable units for diverse modalities and contextual information. This design thus supports high-order relational and multi-step reasoning. The resulting architecture for VideoQA is a CRN hierarchy whose branches represent sub-videos or clips, all sharing the same question as the contextual condition. Our evaluations on well-known datasets achieved new SoTA results, demonstrating the impact of building a general-purpose reasoning unit on complex domains such as VideoQA.

下载PDF全文

下载文献需遵守相关版权规定

论文标题