论文标题
视频语料库中矩的基于文本的本地化
Text-based Localization of Moments in a Video Corpus
论文作者
论文摘要
基于文本的视频时刻本地化的先前工作重点是在未修剪的视频中暂时接地文本查询。这些作品假定相关视频已经知道,并尝试将相关视频的时刻本地定位。与此类作品不同,我们放松了这个假设,并解决了给定句子查询的视频语料库中本地化时刻的任务。此任务在系统需要执行时提出了一个独特的挑战:(i)检索相关视频,其中仅视频的一部分与查询句子相对应,以及(ii)基于句子查询的相关视频中时刻的时间定位。为了克服这一挑战,我们提出了分层时刻对齐网络(HMAN),该网络将学习有效的瞬间和句子嵌入空间。除了学习视频内部时刻之间的细微差异外,HMAN还专注于根据句子查询区分Video Inter-Video全球语义概念。三个基于基准的基于文本的视频时刻检索数据集(Charades -sta,didemo和ActivityNet字幕)上的定性和定量结果表明,我们的方法在拟议的视频语料库中的时刻定位时实现了有希望的表现。
Prior works on text-based video moment localization focus on temporally grounding the textual query in an untrimmed video. These works assume that the relevant video is already known and attempt to localize the moment on that relevant video only. Different from such works, we relax this assumption and address the task of localizing moments in a corpus of videos for a given sentence query. This task poses a unique challenge as the system is required to perform: (i) retrieval of the relevant video where only a segment of the video corresponds with the queried sentence, and (ii) temporal localization of moment in the relevant video based on sentence query. Towards overcoming this challenge, we propose Hierarchical Moment Alignment Network (HMAN) which learns an effective joint embedding space for moments and sentences. In addition to learning subtle differences between intra-video moments, HMAN focuses on distinguishing inter-video global semantic concepts based on sentence queries. Qualitative and quantitative results on three benchmark text-based video moment retrieval datasets - Charades-STA, DiDeMo, and ActivityNet Captions - demonstrate that our method achieves promising performance on the proposed task of temporal localization of moments in a corpus of videos.