Videococa：视频文本建模，并从对比字幕上转移零射击

论文标题

Videococa：视频文本建模，并从对比字幕上转移零射击

VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

论文作者

Yan, Shen, Zhu, Tao, Wang, Zirui, Cao, Yuan, Zhang, Mi, Ghosh, Soham, Wu, Yonghui, Yu, Jiahui

论文摘要

我们探索了建立基础视频文本模型的有效方法。我们介绍了视频镜，可最大程度地重用预验证的图像文本对比字幕（COCA）模型，并通过最少的额外培训使其适应视频文本任务。虽然先前的作品适应具有各种跨帧融合模块的图像文本模型，但我们发现，可口可乐中生成的注意力集合和对比度注意集合层可立即适应扁平的框架嵌入，从而在零散热视频分类和零拍摄的文本到视频效果上产生最新的结果。此外，我们还探索了视频镜顶上轻量级的固定，并在视频提问和视频字幕上取得了良好的结果。

We explore an efficient approach to establish a foundational video-text model. We present VideoCoCa that maximally reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules, we find that the generative attentional pooling and contrastive attentional pooling layers in CoCa are instantly adaptable to flattened frame embeddings, yielding state-of-the-art results on zero-shot video classification and zero-shot text-to-video retrieval. Furthermore, we explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering and video captioning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题