论文标题
视频问题回答的分层条件关系网络
Hierarchical Conditional Relation Networks for Video Question Answering
论文作者
论文摘要
视频问题回答(videoqa)具有挑战性,因为它需要建模能力来提炼动态视觉伪像和遥远的关系,并将它们与语言概念联系起来。我们引入了一个通用可重复使用的神经单元,称为条件关系网络(CRN),该单元是一个构造块,以构建更复杂的结构来表示和推理视频。 CRN作为输入的一系列张力对象和调节功能,并计算一组编码的输出对象。模型构建成为一种简单的复制,重新排列和堆叠这些可重复使用单元的行动,以实现各种方式和上下文信息。因此,该设计支持高阶的关系和多步骤推理。 VideoQA的最终体系结构是一个CRN层次结构,其分支代表子视频或剪辑,所有这些都与上下文条件共享了相同的问题。我们对著名数据集的评估取得了新的SOTA结果,证明了建立通用推理单元对复杂域(例如VideoQA)的影响。
Video question answering (VideoQA) is challenging as it requires modeling capacity to distill dynamic visual artifacts and distant relations and to associate them with linguistic concepts. We introduce a general-purpose reusable neural unit called Conditional Relation Network (CRN) that serves as a building block to construct more sophisticated structures for representation and reasoning over video. CRN takes as input an array of tensorial objects and a conditioning feature, and computes an array of encoded output objects. Model building becomes a simple exercise of replication, rearrangement and stacking of these reusable units for diverse modalities and contextual information. This design thus supports high-order relational and multi-step reasoning. The resulting architecture for VideoQA is a CRN hierarchy whose branches represent sub-videos or clips, all sharing the same question as the contextual condition. Our evaluations on well-known datasets achieved new SoTA results, demonstrating the impact of building a general-purpose reasoning unit on complex domains such as VideoQA.