HITEA：层次暂时感知的视频语言预培训

论文标题

HITEA：层次暂时感知的视频语言预培训

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

论文作者

Ye, Qinghao, Xu, Guohai, Yan, Ming, Xu, Haiyang, Qian, Qi, Zhang, Ji, Huang, Fei

论文摘要

视频语言预培训提高了各种下游视频语言任务的性能。但是，大多数以前的方法直接继承或适应典型的图像语言预训练范式，以适应视频语言预训练，因此无法完全利用视频的独特特征，即时间。在本文中，我们提出了一个层次的时间感知视频语言预训练框架HITEA，并采用了两项新颖的训练预训练任务，用于建模矩和文本之间的跨模式对齐，以及视频text对的时间关系。具体来说，我们提出了一项跨模式时刻探索任务，以探索视频中的时刻，从而导致详细的视频时刻表示。此外，固有的时间关系是通过在不同时间分辨率与多模式时间关系探索任务的整体上对齐视频文本对捕获的。此外，我们介绍了改组测试，以评估数据集和视频语言预训练模型的时间依赖。我们在15个完善的视频语言理解和发电任务上取得了最新的结果，尤其是在面向时间的数据集（例如SSV2-Template和SSV2标签）上，分别提高了8.6％和11.1％。当直接以零射击方式转移到下游任务时，HITEA还表现出强大的概括能力。模型和演示将在ModelsCope上可用。

Video-language pre-training has advanced the performance of various downstream video-language tasks. However, most previous methods directly inherit or adapt typical image-language pre-training paradigms to video-language pre-training, thus not fully exploiting the unique characteristic of video, i.e., temporal. In this paper, we propose a Hierarchical Temporal-Aware video-language pre-training framework, HiTeA, with two novel pre-training tasks for modeling cross-modal alignment between moments and texts as well as the temporal relations of video-text pairs. Specifically, we propose a cross-modal moment exploration task to explore moments in videos, which results in detailed video moment representation. Besides, the inherent temporal relations are captured by aligning video-text pairs as a whole in different time resolutions with multi-modal temporal relation exploration task. Furthermore, we introduce the shuffling test to evaluate the temporal reliance of datasets and video-language pre-training models. We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks, especially on temporal-oriented datasets (e.g., SSv2-Template and SSv2-Label) with 8.6% and 11.1% improvement respectively. HiTeA also demonstrates strong generalization ability when directly transferred to downstream tasks in a zero-shot manner. Models and demo will be available on ModelScope.

下载PDF全文

下载文献需遵守相关版权规定

论文标题