论文标题

TVR:用于视频摘要时刻检索的大规模数据集

TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval

论文作者

Lei, Jie, Yu, Licheng, Berg, Tamara L., Bansal, Mohit

论文摘要

我们介绍了电视节目检索(TVR),这是一个新的多模式检索数据集。 TVR需要系统来了解视频及其相关的字幕(对话)文本,从而使其更现实。该数据集包含从6个电视节目的21.8k视频中收集的109K查询,其中每个查询都与紧密的时间窗口相关联。查询还标有查询类型,这些查询类型指示它们中的每个类型是否与视频或字幕相关或两者都相关,也可以对数据集进行深入分析以及在其顶部构建的方法。应用严格的资格和后解析验证测试,以确保收集的数据的质量。此外,我们提出了几个基线和一个新型的跨模式力矩定位(XML)网络,用于多模式时刻检索任务。提出的XML模型使用了新型的卷积起始探测器(Convse),使用了晚期融合设计,以大幅度和更高的效率超过基线,为将来的工作提供了强大的起点。我们还为TVR中的每个注释时刻收集了其他描述,以形成一个带有262K字幕的新的多模式字幕数据集,名为电视节目标题(TVC)。这两个数据集均可公开使用。 TVR:https://tvr.cs.unc.edu,TVC:https://tvr.cs.unc.edu/tvc.html。

We introduce TV show Retrieval (TVR), a new multimodal retrieval dataset. TVR requires systems to understand both videos and their associated subtitle (dialogue) texts, making it more realistic. The dataset contains 109K queries collected on 21.8K videos from 6 TV shows of diverse genres, where each query is associated with a tight temporal window. The queries are also labeled with query types that indicate whether each of them is more related to video or subtitle or both, allowing for in-depth analysis of the dataset and the methods that built on top of it. Strict qualification and post-annotation verification tests are applied to ensure the quality of the collected data. Further, we present several baselines and a novel Cross-modal Moment Localization (XML ) network for multimodal moment retrieval tasks. The proposed XML model uses a late fusion design with a novel Convolutional Start-End detector (ConvSE), surpassing baselines by a large margin and with better efficiency, providing a strong starting point for future work. We have also collected additional descriptions for each annotated moment in TVR to form a new multimodal captioning dataset with 262K captions, named TV show Caption (TVC). Both datasets are publicly available. TVR: https://tvr.cs.unc.edu, TVC: https://tvr.cs.unc.edu/tvc.html.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源