流噪声上下文意识到在多对话器环境中自动语音识别的增强

论文标题

流噪声上下文意识到在多对话器环境中自动语音识别的增强

Streaming Noise Context Aware Enhancement For Automatic Speech Recognition in Multi-Talker Environments

论文作者

Caroselli, Joe, Narayanan, Arun, Huang, Yiteng

论文摘要

对于智能演讲者来说，最具挑战性的场景之一是多对待者，当时所需的演讲者的目标演讲与一个或多个演讲者的干预演讲混合在一起时。聪明的助手需要确定要识别哪种声音和要忽略的声音，并且需要以流式低延节的方式进行。这项工作介绍了针对此情况的两种针对性的多微粒语音增强算法。针对设备上的用例，我们假设该算法在热词之前可以访问信号，该算法称为噪声上下文。首先是使用噪声上下文并检测到的热词来确定如何定位所需扬声器的上下文意识到的波束形式。第二个是一种称为语音清洁器的自适应噪声消除算法，该算法使用噪声上下文训练过滤器。证明这两种算法在它们工作良好的信噪比条件下是互补的。我们还提出了一种算法来选择基于估计的SNR使用的算法。当使用3个麦克风通道时，最终系统可在-12dB时达到55％的相对单词错误率，而在12dB时为43 \％。

One of the most challenging scenarios for smart speakers is multi-talker, when target speech from the desired speaker is mixed with interfering speech from one or more speakers. A smart assistant needs to determine which voice to recognize and which to ignore and it needs to do so in a streaming, low-latency manner. This work presents two multi-microphone speech enhancement algorithms targeted at this scenario. Targeting on-device use-cases, we assume that the algorithm has access to the signal before the hotword, which is referred to as the noise context. First is the Context Aware Beamformer which uses the noise context and detected hotword to determine how to target the desired speaker. The second is an adaptive noise cancellation algorithm called Speech Cleaner which trains a filter using the noise context. It is demonstrated that the two algorithms are complementary in the signal-to-noise ratio conditions under which they work well. We also propose an algorithm to select which one to use based on estimated SNR. When using 3 microphone channels, the final system achieves a relative word error rate reduction of 55% at -12dB, and 43\% at 12dB.

下载PDF全文

下载文献需遵守相关版权规定

论文标题