论文标题
解决鸡尾酒叉问题,以分离和转录现实世界的配乐
Tackling the Cocktail Fork Problem for Separation and Transcription of Real-World Soundtracks
论文作者
论文摘要
模仿人类解决鸡尾酒会问题的能力,即专注于复杂的声学场景的兴趣来源,是音频源分离研究的长期目标。这项研究的大部分研究都调查了语音与噪音,言语,乐器彼此或彼此的声音事件的分开。在本文中,我们专注于鸡尾酒叉问题,该问题采用了三个parmed的方法来源分离,通过将电影配乐或播客等音频混合物分开为言语,音乐和音效(SFX-被理解为环境噪声和自然声音事件)的三个广泛类别)。我们基准在此任务上基于几个基于深度学习的源分离模型的性能,并根据简单的客观措施(例如信噪比(SDR)(SDR))以及与人类感知更好相关的客观指标进行评估。此外,我们彻底评估了源分离如何影响下游转录任务。首先,我们研究了这三个来源的活动检测任务,这是一种进一步改善源分离和执行转录的方法。我们将转录任务制定为语音识别,用于音乐和SFX的语音和音频标签。我们观察到,尽管使用源分离估计值与原始配乐相比提高了转录性能,但由于分离过程引入的伪像,性能仍然是最佳的。因此,我们彻底研究了如何在各种相对水平上的三个分离源茎进行混合可以减少伪影并因此改善转录性能。我们发现,在17.5 dB的目标SNR上进行混合音乐和SFX干扰降低了语音识别单词错误率,并且观察到对标记音乐和SFX内容的混音影响相似。
Emulating the human ability to solve the cocktail party problem, i.e., focus on a source of interest in a complex acoustic scene, is a long standing goal of audio source separation research. Much of this research investigates separating speech from noise, speech from speech, musical instruments from each other, or sound events from each other. In this paper, we focus on the cocktail fork problem, which takes a three-pronged approach to source separation by separating an audio mixture such as a movie soundtrack or podcast into the three broad categories of speech, music, and sound effects (SFX - understood to include ambient noise and natural sound events). We benchmark the performance of several deep learning-based source separation models on this task and evaluate them with respect to simple objective measures such as signal-to-distortion ratio (SDR) as well as objective metrics that better correlate with human perception. Furthermore, we thoroughly evaluate how source separation can influence downstream transcription tasks. First, we investigate the task of activity detection on the three sources as a way to both further improve source separation and perform transcription. We formulate the transcription tasks as speech recognition for speech and audio tagging for music and SFX. We observe that, while the use of source separation estimates improves transcription performance in comparison to the original soundtrack, performance is still sub-optimal due to artifacts introduced by the separation process. Therefore, we thoroughly investigate how remixing of the three separated source stems at various relative levels can reduce artifacts and consequently improve the transcription performance. We find that remixing music and SFX interferences at a target SNR of 17.5 dB reduces speech recognition word error rate, and similar impact from remixing is observed for tagging music and SFX content.