仔细观察视听多人语音识别和主动演讲者的选择

论文标题

仔细观察视听多人语音识别和主动演讲者的选择

A Closer Look at Audio-Visual Multi-Person Speech Recognition and Active Speaker Selection

论文作者

Braga, Otavio, Siohan, Olivier

论文摘要

视听自动语音识别是在嘈杂条件下鲁棒性ASR的一种有希望的方法。但是，直到最近，它一直在孤立地进行研究，假设单个说话的面孔与音频相匹配，并在推理时间选择主动扬声器，当时有多个人在屏幕上被放置为一个单独的问题。作为替代方案，最近的工作提议通过注意机制同时解决这两个问题，将扬声器选择问题直接烘烤到一个完全可区分的模型中。一个有趣的发现是，即使在培训时从未明确提供这种对应关系，但注意力从未明确提供了音频和口语面之间的关联。在目前的工作中，我们进一步研究了这一联系，并检查了两个问题之间的相互作用。通过涉及超过5万小时的公共YouTube视频作为培训数据的实验，我们首先评估注意力层在主动扬声器选择任务上的准确性。其次，我们在仔细审查下表明，端到端模型至少和更大的两步系统执行，该系统在各种噪声条件和平行面部轨道的数量下都使用了艰难的决策边界。

Audio-visual automatic speech recognition is a promising approach to robust ASR under noisy conditions. However, up until recently it had been traditionally studied in isolation assuming the video of a single speaking face matches the audio, and selecting the active speaker at inference time when multiple people are on screen was put aside as a separate problem. As an alternative, recent work has proposed to address the two problems simultaneously with an attention mechanism, baking the speaker selection problem directly into a fully differentiable model. One interesting finding was that the attention indirectly learns the association between the audio and the speaking face even though this correspondence is never explicitly provided at training time. In the present work we further investigate this connection and examine the interplay between the two problems. With experiments involving over 50 thousand hours of public YouTube videos as training data, we first evaluate the accuracy of the attention layer on an active speaker selection task. Secondly, we show under closer scrutiny that an end-to-end model performs at least as well as a considerably larger two-step system that utilizes a hard decision boundary under various noise conditions and number of parallel face tracks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题