基于故事的视频理解的共同发音变压器

论文标题

基于故事的视频理解的共同发音变压器

Co-attentional Transformers for Story-Based Video Understanding

论文作者

Bebensee, Björn, Zhang, Byoung-Tak

论文摘要

受视觉和语言学习趋势的启发，我们探索了注意力与语言融合的应用在基于故事的视频理解中的应用中。像其他基于视频的质量检查任务一样，视频故事理解需要代理人掌握复杂的时间依赖性。但是，由于它着重于视频的叙事方面，因此还需要了解不同角色之间的相互作用以及它们的行动和动机。我们提出了一种新颖的共同变压器模型，以更好地捕获在视觉故事（例如戏剧）中看到的长期依赖性，并在视频问题回答任务上衡量其表现。我们在最近引入的Dramaqa数据集上评估了我们的方法，该数据集具有以角色为中心的视频故事理解问题。我们的模型在所有难度水平上的总体上优于基线模型，至少4.95，最多可超过12.8个百分点，并设法超过了Dramaqa Challenge的获胜者。

Inspired by recent trends in vision and language learning, we explore applications of attention mechanisms for visio-lingual fusion within an application to story-based video understanding. Like other video-based QA tasks, video story understanding requires agents to grasp complex temporal dependencies. However, as it focuses on the narrative aspect of video it also requires understanding of the interactions between different characters, as well as their actions and their motivations. We propose a novel co-attentional transformer model to better capture long-term dependencies seen in visual stories such as dramas and measure its performance on the video question answering task. We evaluate our approach on the recently introduced DramaQA dataset which features character-centered video story understanding questions. Our model outperforms the baseline model by 8 percentage points overall, at least 4.95 and up to 12.8 percentage points on all difficulty levels and manages to beat the winner of the DramaQA challenge.

下载PDF全文

下载文献需遵守相关版权规定

论文标题