论文标题
活跃扬声器检测的生物启发的方式融合
Bio-Inspired Modality Fusion for Active Speaker Detection
论文作者
论文摘要
人类已经开发了出色的能力,以整合来自各种感官来源的信息,以探索其固有的互补性。因此,感知能力得到了增强,例如,著名的“鸡尾酒会”和麦格克效应,即从全方位信号的言语歧视。这种融合能力也是完善声音源位置感知的关键,就像区分小组对话中听到谁的声音一样。此外,神经科学已成功地确定了大脑中的上丘区域是负责这种模态融合的一种,并提出了一些生物学模型来处理其潜在的神经生理过程。本文从其中一种模型中获得灵感,提出了一种有效融合相关的听觉和视觉信息的方法,以进行主动扬声器检测。从电信系统到社交机器人技术,这种能力可以具有广泛的应用程序。检测方法最初通过两个专门的神经网络结构来划出听觉和视觉信息。所得的嵌入通过基于上丘的新层融合,其拓扑结构模仿了单峰知觉场的空间神经元交叉图。验证过程采用了两个公开可用的数据集,并取得了确认并超过初始期望的结果。
Human beings have developed fantastic abilities to integrate information from various sensory sources exploring their inherent complementarity. Perceptual capabilities are therefore heightened, enabling, for instance, the well-known "cocktail party" and McGurk effects, i.e., speech disambiguation from a panoply of sound signals. This fusion ability is also key in refining the perception of sound source location, as in distinguishing whose voice is being heard in a group conversation. Furthermore, neuroscience has successfully identified the superior colliculus region in the brain as the one responsible for this modality fusion, with a handful of biological models having been proposed to approach its underlying neurophysiological process. Deriving inspiration from one of these models, this paper presents a methodology for effectively fusing correlated auditory and visual information for active speaker detection. Such an ability can have a wide range of applications, from teleconferencing systems to social robotics. The detection approach initially routes auditory and visual information through two specialized neural network structures. The resulting embeddings are fused via a novel layer based on the superior colliculus, whose topological structure emulates spatial neuron cross-mapping of unimodal perceptual fields. The validation process employed two publicly available datasets, with achieved results confirming and greatly surpassing initial expectations.