视频事件通过跟踪参数的视觉状态提取

论文标题

视频事件通过跟踪参数的视觉状态提取

Video Event Extraction via Tracking Visual States of Arguments

论文作者

Yang, Guang, Li, Manling, Zhang, Jiajie, Lin, Xudong, Chang, Shih-Fu, Ji, Heng

论文摘要

视频事件提取旨在从视频中检测显着事件，并确定每个事件的参数及其语义角色。现有方法着重于捕获每个帧的整体视觉场景，而忽略了细粒度的参数级信息。受到事件为州变化的定义的启发，我们提出了一个新颖的框架来检测视频事件，通过跟踪所有相关参数的视觉状态的变化，这些参数有望为提取视频事件的提取提供最有用的证据。为了捕获参数的视觉状态变化，我们将它们分解为对象，对象的位移以及多个参数之间的相互作用的变化。我们进一步建议对象状态嵌入，对象运动感知嵌入和参数交互分别嵌入以分别编码和跟踪这些更改。与最先进的模型相比，各种视频事件提取任务的实验表明了显着改善。特别是，在动词分类时，我们在视频情况识别中获得了F1@5中的3.49％的绝对收益（19.53％的相对增益）。

Video event extraction aims to detect salient events from a video and identify the arguments for each event as well as their semantic roles. Existing methods focus on capturing the overall visual scene of each frame, ignoring fine-grained argument-level information. Inspired by the definition of events as changes of states, we propose a novel framework to detect video events by tracking the changes in the visual states of all involved arguments, which are expected to provide the most informative evidence for the extraction of video events. In order to capture the visual state changes of arguments, we decompose them into changes in pixels within objects, displacements of objects, and interactions among multiple arguments. We further propose Object State Embedding, Object Motion-aware Embedding and Argument Interaction Embedding to encode and track these changes respectively. Experiments on various video event extraction tasks demonstrate significant improvements compared to state-of-the-art models. In particular, on verb classification, we achieve 3.49% absolute gains (19.53% relative gains) in F1@5 on Video Situation Recognition.

下载PDF全文

下载文献需遵守相关版权规定

论文标题