在视频中学习时间句子本地化和事件字幕的学习方式互动

论文标题

在视频中学习时间句子本地化和事件字幕的学习方式互动

Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos

论文作者

Chen, Shaoxiang, Jiang, Wenhao, Liu, Wei, Jiang, Yu-Gang

论文摘要

自动生成句子来描述事件并在视频中定位暂时性的句子是桥接语言和视频的两个重要任务。最近的技术通过使用现行功能代表视频来利用视频的多模式性质，但是很少探索模式之间的互动。受到人类大脑中存在跨模式相互作用的事实的启发，我们提出了一种学习成对模态相互作用的新方法，以便更好地利用视频中每对方式的互补信息，从而改善这两个任务的性能。我们以成对的方式在序列和通道水平上建模模态相互作用，成对相互作用还为目标任务的预测提供了一些解释性。我们证明了我们方法的有效性，并通过广泛的消融研究来验证特定的设计选择。事实证明，我们的方法是在四个标准基准数据集上实现最先进的性能：MSVD和MSR-VTT（事件字幕任务）以及Charades-STA和ActivityNet字幕（时间句子本地化任务）。

Automatically generating sentences to describe events and temporally localizing sentences in a video are two important tasks that bridge language and videos. Recent techniques leverage the multimodal nature of videos by using off-the-shelf features to represent videos, but interactions between modalities are rarely explored. Inspired by the fact that there exist cross-modal interactions in the human brain, we propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos and thus improve performances on both tasks. We model modality interaction in both the sequence and channel levels in a pairwise fashion, and the pairwise interaction also provides some explainability for the predictions of target tasks. We demonstrate the effectiveness of our method and validate specific design choices through extensive ablation studies. Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets: MSVD and MSR-VTT (event captioning task), and Charades-STA and ActivityNet Captions (temporal sentence localization task).

下载PDF全文

下载文献需遵守相关版权规定

论文标题