基于晶格的改进，用于使用图形神经网络进行语音触发

论文标题

基于晶格的改进，用于使用图形神经网络进行语音触发

Lattice-based Improvements for Voice Triggering Using Graph Neural Networks

论文作者

Dighe, Pranay, Adya, Saurabh, Li, Nuoyu, Vishnubhotla, Srikanth, Naik, Devang, Sagar, Adithya, Ma, Ying, Pulman, Stephen, Williams, Jason

论文摘要

语音触发的智能助手通常在开始聆听用户请求之前依靠触发词组的检测。缓解虚假触发因素是建立以隐私为中心的非侵入式智能助手的重要方面。在本文中，我们使用基于图形神经网络（GNN）的分析自动语音识别（ASR）晶格的新方法来解决错误触发缓解（FTM）的任务。所提出的方法使用这样一个事实，即，与正确触发的音频的晶格相比，在许多替代路径和晶格弧上的意外单词方面，错误触发的音频的解码晶格表现出不确定性。纯触发式探测器模型并不能完全利用用户语音的意图，而通过使用用户音频的完整解码晶格，我们可以有效地减轻对智能助手而非旨在的语音。我们基于1）图卷积层和2）自我注意机制在本文中部署了两种GNN的变体。我们的实验表明，通过以99％的真实正率（TPR）减轻约87％的假触发因素，GNN在FTM任务中非常准确。此外，提议的模型可以迅速训练和有效地在参数要求上。

Voice-triggered smart assistants often rely on detection of a trigger-phrase before they start listening for the user request. Mitigation of false triggers is an important aspect of building a privacy-centric non-intrusive smart assistant. In this paper, we address the task of false trigger mitigation (FTM) using a novel approach based on analyzing automatic speech recognition (ASR) lattices using graph neural networks (GNN). The proposed approach uses the fact that decoding lattice of a falsely triggered audio exhibits uncertainties in terms of many alternative paths and unexpected words on the lattice arcs as compared to the lattice of a correctly triggered audio. A pure trigger-phrase detector model doesn't fully utilize the intent of the user speech whereas by using the complete decoding lattice of user audio, we can effectively mitigate speech not intended for the smart assistant. We deploy two variants of GNNs in this paper based on 1) graph convolution layers and 2) self-attention mechanism respectively. Our experiments demonstrate that GNNs are highly accurate in FTM task by mitigating ~87% of false triggers at 99% true positive rate (TPR). Furthermore, the proposed models are fast to train and efficient in parameter requirements.

下载PDF全文

下载文献需遵守相关版权规定

论文标题