论文标题
自我监督的对比度学习,用于视听动作识别
Self-supervised Contrastive Learning for Audio-Visual Action Recognition
论文作者
论文摘要
音频和视觉方式之间的基本相关性可用于学习无标记视频的监督信息。在本文中,我们提出了一个名为“视听对比学习”(AVCL)的端到端自我监督的框架,以学习歧视性的音频表述以进行行动识别。具体而言,我们设计了一个基于注意力的多模式融合模块(AMFM)来融合音频和视觉方式。为了使异构视听方式保持一致,我们构建了一种新颖的共同相关表示表示模块(CGRA)。为了从未标记的视频中学习监督信息,我们提出了一个新颖的自我监督对比学习模块(SEXTCL)。此外,我们构建了一个新的视听动作识别数据集,名为Kinetics-sounds100。 Kinetics-SOUNDS32和动力学索数据数据集的实验结果证明了我们的AVCL优于最先进的方法对大规模动作识别基准的优势。
The underlying correlation between audio and visual modalities can be utilized to learn supervised information for unlabeled videos. In this paper, we propose an end-to-end self-supervised framework named Audio-Visual Contrastive Learning (AVCL), to learn discriminative audio-visual representations for action recognition. Specifically, we design an attention based multi-modal fusion module (AMFM) to fuse audio and visual modalities. To align heterogeneous audio-visual modalities, we construct a novel co-correlation guided representation alignment module (CGRA). To learn supervised information from unlabeled videos, we propose a novel self-supervised contrastive learning module (SelfCL). Furthermore, we build a new audio-visual action recognition dataset named Kinetics-Sounds100. Experimental results on Kinetics-Sounds32 and Kinetics-Sounds100 datasets demonstrate the superiority of our AVCL over the state-of-the-art methods on large-scale action recognition benchmark.