对比性视频学习和细粒框架采样

论文标题

对比性视频学习和细粒框架采样

Contrastive Video-Language Learning with Fine-grained Frame Sampling

论文作者

Wang, Zixu, Zhong, Yujie, Miao, Yishu, Ma, Lin, Specia, Lucia

论文摘要

尽管视频和语言表示学习最近取得了进展，但两种方式之间的弱或稀疏对应关系仍然是该地区的瓶颈。大多数视频语言模型都是通过成对级别的损失训练的，以预测是否对齐一对视频和文本。但是，即使在配对的视频文本段中，只有一个帧的子集在语义上与相应的文本相关，其余代表噪声。对于更长的视频，嘈杂框架的比率更高。我们提出了Fineco（框架采样的细粒对比损失），一种更好地学习视频和语言表示的方法，具有在视频框架上运行的细粒对比目标。它通过选择语义上等同于文本的帧来帮助散布视频，从而改善交叉模式对应关系。 Fineco以良好的视频模型为起点，在YouCookii上取得了最新的性能，YouCookii是一种带有长视频的文本视频检索基准。 Fineco还通过较短的视频在文本视频检索（MSR-VTT）（MSR-VTT）上取得了竞争成果（MSR-VTT）（MSR-VTT）（MSR-VTT QA和MSR-VTT MC）。

Despite recent progress in video and language representation learning, the weak or sparse correspondence between the two modalities remains a bottleneck in the area. Most video-language models are trained via pair-level loss to predict whether a pair of video and text is aligned. However, even in paired video-text segments, only a subset of the frames are semantically relevant to the corresponding text, with the remainder representing noise; where the ratio of noisy frames is higher for longer videos. We propose FineCo (Fine-grained Contrastive Loss for Frame Sampling), an approach to better learn video and language representations with a fine-grained contrastive objective operating on video frames. It helps distil a video by selecting the frames that are semantically equivalent to the text, improving cross-modal correspondence. Building on the well established VideoCLIP model as a starting point, FineCo achieves state-of-the-art performance on YouCookII, a text-video retrieval benchmark with long videos. FineCo also achieves competitive results on text-video retrieval (MSR-VTT), and video question answering datasets (MSR-VTT QA and MSR-VTT MC) with shorter videos.

下载PDF全文

下载文献需遵守相关版权规定

论文标题