剪辑VIP：将预训练的图像文本模型调整为视频表示形式对齐

论文标题

剪辑VIP：将预训练的图像文本模型调整为视频表示形式对齐

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

论文作者

Xue, Hongwei, Sun, Yuchong, Liu, Bei, Fu, Jianlong, Song, Ruihua, Li, Houqiang, Luo, Jiebo

论文摘要

预先训练的图像文本模型（如剪辑）已经证明了从大规模的Web收集的图像文本数据中学到的视觉表示的强大力量。鉴于学习良好的视觉特征，一些现有的作品将图像表示转移到视频域并取得良好的结果。但是，如何利用图像语言预训练的模型（例如，剪辑）进行视频前训练（后培训）仍在探索中。在本文中，我们研究了两个问题：1）阻碍后期剪辑的因素是什么因素，以进一步提高视频语言任务的性能？ 2）如何减轻这些因素的影响？通过一系列比较实验和分析，我们发现语言源之间的数据量表和域间隙具有很大的影响。由这些动机，我们提出了一种具有视频代理机制的Omnisource跨模式学习方法，即剪辑，即剪辑VIP。广泛的结果表明，我们的方法可以提高视频检索的剪辑的性能。我们的模型还可以在包括MSR-VTT，DIDEMO，LSMDC和ActivityNet在内的各种数据集上实现SOTA结果。我们将在https://github.com/microsoft/xpretrain/tree/main/main/main/clip-vip上发布代码和预训练的剪辑模型。

The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language representation learned from a large scale of web-collected image-text data. In light of the well-learned visual features, some existing works transfer image representation to video domain and achieve good results. However, how to utilize image-language pre-trained model (e.g., CLIP) for video-language pre-training (post-pretraining) is still under explored. In this paper, we investigate two questions: 1) what are the factors hindering post-pretraining CLIP to further improve the performance on video-language tasks? and 2) how to mitigate the impact of these factors? Through a series of comparative experiments and analyses, we find that the data scale and domain gap between language sources have great impacts. Motivated by these, we propose a Omnisource Cross-modal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP. Extensive results show that our approach improves the performance of CLIP on video-text retrieval by a large margin. Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet. We will release our code and pre-trained CLIP-ViP models at https://github.com/microsoft/XPretrain/tree/main/CLIP-ViP.

下载PDF全文

下载文献需遵守相关版权规定

论文标题