论文标题
视听活动引导的跨模式认同关联,用于主动扬声器检测
Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection
论文作者
论文摘要
视频中的主动扬声器检测与视频框架中可见的源脸有关,并在音频方式中与基础语音相关联。得出这种语音关系关系的两个主要信息来源是i)视觉活动及其与语音信号的相互作用,ii)以面部和语音的形式跨模态的说话者身份的同时存在。这两种方法具有其局限性:视听活动模型与其他经常发生的人声活动(例如笑和咀嚼)混淆,而扬声器的基于扬声器的基于身份的方法仅限于具有足够的歧义信息以建立语音面式协会的视频。由于两种方法是独立的,因此我们在这项工作中研究了它们的互补性。我们提出了一个新颖的无监督框架,以指导说话者的跨模式身份与主动说话者检测的视听活动的跨模式身份关联。通过来自两个基准数据集的娱乐媒体视频的实验,AVA主动扬声器(电影)和视觉人员聚类数据集(电视节目),我们表明,两种方法的简单晚期融合可以增强主动扬声器检测性能。
Active speaker detection in videos addresses associating a source face, visible in the video frames, with the underlying speech in the audio modality. The two primary sources of information to derive such a speech-face relationship are i) visual activity and its interaction with the speech signal and ii) co-occurrences of speakers' identities across modalities in the form of face and speech. The two approaches have their limitations: the audio-visual activity models get confused with other frequently occurring vocal activities, such as laughing and chewing, while the speakers' identity-based methods are limited to videos having enough disambiguating information to establish a speech-face association. Since the two approaches are independent, we investigate their complementary nature in this work. We propose a novel unsupervised framework to guide the speakers' cross-modal identity association with the audio-visual activity for active speaker detection. Through experiments on entertainment media videos from two benchmark datasets, the AVA active speaker (movies) and Visual Person Clustering Dataset (TV shows), we show that a simple late fusion of the two approaches enhances the active speaker detection performance.