查看，倾听和参加：共同注意的网络，用于自我监督的视听表示学习

论文标题

查看，倾听和参加：共同注意的网络，用于自我监督的视听表示学习

Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning

论文作者

Cheng, Ying, Wang, Ruize, Pan, Zhihao, Feng, Rui, Zhang, Yuejie

论文摘要

观看视频时，视觉事件的发生通常伴随着音频事件，例如唇部运动的声音，演奏乐器的音乐。音频和视觉事件之间存在基本相关性，可以用作免费的监督信息来通过求解视听同步的借口来训练神经网络。在本文中，我们提出了一个新颖的自我监督框架，并通过共同发音机制从野外未标记的视频中学习通用的跨模式表示，并进一步使下游任务受益。具体而言，我们探索了三个不同的共同注意模块，以关注与声音相关的歧视性视觉区域并引入它们之间的相互作用。实验表明，与现有方法相比，我们的模型在借口任务上实现了最新的性能，而参数较少。为了进一步评估我们方法的概括性和可传递性，我们将预训练的模型应用于两个下游任务，即声音源定位和动作识别。广泛的实验表明，我们的模型通过其他自我监督的方法提供了竞争结果，并且还表明我们的方法可以解决包含多个声音源的具有挑战性的场景。

When watching videos, the occurrence of a visual event is often accompanied by an audio event, e.g., the voice of lip motion, the music of playing instruments. There is an underlying correlation between audio and visual events, which can be utilized as free supervised information to train a neural network by solving the pretext task of audio-visual synchronization. In this paper, we propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos in the wild, and further benefit downstream tasks. Specifically, we explore three different co-attention modules to focus on discriminative visual regions correlated to the sounds and introduce the interactions between them. Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods. To further evaluate the generalizability and transferability of our approach, we apply the pre-trained model on two downstream tasks, i.e., sound source localization and action recognition. Extensive experiments demonstrate that our model provides competitive results with other self-supervised methods, and also indicate that our approach can tackle the challenging scenes which contain multiple sound sources.

下载PDF全文

下载文献需遵守相关版权规定

论文标题