论文标题
对比性视频学习和细粒框架采样
Contrastive Video-Language Learning with Fine-grained Frame Sampling
论文作者
论文摘要
尽管视频和语言表示学习最近取得了进展,但两种方式之间的弱或稀疏对应关系仍然是该地区的瓶颈。大多数视频语言模型都是通过成对级别的损失训练的,以预测是否对齐一对视频和文本。但是,即使在配对的视频文本段中,只有一个帧的子集在语义上与相应的文本相关,其余代表噪声。对于更长的视频,嘈杂框架的比率更高。我们提出了Fineco(框架采样的细粒对比损失),一种更好地学习视频和语言表示的方法,具有在视频框架上运行的细粒对比目标。它通过选择语义上等同于文本的帧来帮助散布视频,从而改善交叉模式对应关系。 Fineco以良好的视频模型为起点,在YouCookii上取得了最新的性能,YouCookii是一种带有长视频的文本视频检索基准。 Fineco还通过较短的视频在文本视频检索(MSR-VTT)(MSR-VTT)上取得了竞争成果(MSR-VTT)(MSR-VTT)(MSR-VTT QA和MSR-VTT MC)。
Despite recent progress in video and language representation learning, the weak or sparse correspondence between the two modalities remains a bottleneck in the area. Most video-language models are trained via pair-level loss to predict whether a pair of video and text is aligned. However, even in paired video-text segments, only a subset of the frames are semantically relevant to the corresponding text, with the remainder representing noise; where the ratio of noisy frames is higher for longer videos. We propose FineCo (Fine-grained Contrastive Loss for Frame Sampling), an approach to better learn video and language representations with a fine-grained contrastive objective operating on video frames. It helps distil a video by selecting the frames that are semantically equivalent to the text, improving cross-modal correspondence. Building on the well established VideoCLIP model as a starting point, FineCo achieves state-of-the-art performance on YouCookII, a text-video retrieval benchmark with long videos. FineCo also achieves competitive results on text-video retrieval (MSR-VTT), and video question answering datasets (MSR-VTT QA and MSR-VTT MC) with shorter videos.