论文标题
通过交叉模式干扰擦除在野外的视觉声音定位
Visual Sound Localization in the Wild by Cross-Modal Interference Erasing
论文作者
论文摘要
在有限的场景下,音频声音源本地化的任务已经对声音录制清洁的场景进行了充分的研究。但是,在实际情况下,音频通常会被屏幕外声音和背景噪音污染。他们将干扰识别所需源并建立视觉响起的连接的过程,从而使以前的研究不适用。在这项工作中,我们提出了干扰橡皮擦(IER)框架,该框架解决了野外视听声音源本地化的问题。关键思想是通过重新定义和雕刻歧视性音频表示来消除干扰。具体而言,我们观察到,由于音频信号的附加性质,先前仅学习单个音频表示的做法不足。因此,我们将音频表示形式扩展到我们的音频识别器模块,该模块显然区分了当不同卷的音频信号不均匀混合时的声音实例。然后,我们通过具有交叉模式蒸馏的交叉模块模块删除可听见但屏幕外声音的影响和无声但可见的对象。定量和定性评估表明,我们提出的框架在良好的本地化任务上取得了卓越的成果,尤其是在现实情况下。代码可在https://github.com/alvinliu0/visual-sound-localization-in-in-the-wild中找到。
The task of audio-visual sound source localization has been well studied under constrained scenes, where the audio recordings are clean. However, in real-world scenarios, audios are usually contaminated by off-screen sound and background noise. They will interfere with the procedure of identifying desired sources and building visual-sound connections, making previous studies non-applicable. In this work, we propose the Interference Eraser (IEr) framework, which tackles the problem of audio-visual sound source localization in the wild. The key idea is to eliminate the interference by redefining and carving discriminative audio representations. Specifically, we observe that the previous practice of learning only a single audio representation is insufficient due to the additive nature of audio signals. We thus extend the audio representation with our Audio-Instance-Identifier module, which clearly distinguishes sounding instances when audio signals of different volumes are unevenly mixed. Then we erase the influence of the audible but off-screen sounds and the silent but visible objects by a Cross-modal Referrer module with cross-modality distillation. Quantitative and qualitative evaluations demonstrate that our proposed framework achieves superior results on sound localization tasks, especially under real-world scenarios. Code is available at https://github.com/alvinliu0/Visual-Sound-Localization-in-the-Wild.