神经传感器中个性化语音识别的上下文适配器

论文标题

神经传感器中个性化语音识别的上下文适配器

Contextual Adapters for Personalized Speech Recognition in Neural Transducers

论文作者

Sathyendra, Kanthashree Mysore, Muniyappa, Thejaswi, Chang, Feng-Ju, Liu, Jing, Su, Jinru, Strimel, Grant P., Mouchtaris, Athanasios, Kunzmann, Siegfried

论文摘要

由于缺乏培训数据，端到端自动语音识别（E2E ASR）模型中的个人罕见单词识别是一个挑战。解决此问题的一种标准方法是推理时使用浅融合方法。但是，由于它们依赖外部语言模型和确定性的体重提高方法，因此其性能是有限的。在本文中，我们建议在基于神经传感器的ASR模型中进行个性化培训神经环境适配器。我们的方法不仅可以偏向用户定义的单词，而且可以灵活地使用验证的ASR模型。使用内部数据集，我们证明可以将上下文适配器应用于预估计的ASR模型以改善个性化的任何通用适配器。我们的方法的表现优于浅融合，同时通过不改变任何模型权重来保留验证模型的功能。我们进一步表明，适配器样式培训优于具有用户定义内容的数据集上ASR模型的全面调整。

Personal rare word recognition in end-to-end Automatic Speech Recognition (E2E ASR) models is a challenge due to the lack of training data. A standard way to address this issue is with shallow fusion methods at inference time. However, due to their dependence on external language models and the deterministic approach to weight boosting, their performance is limited. In this paper, we propose training neural contextual adapters for personalization in neural transducer based ASR models. Our approach can not only bias towards user-defined words, but also has the flexibility to work with pretrained ASR models. Using an in-house dataset, we demonstrate that contextual adapters can be applied to any general purpose pretrained ASR model to improve personalization. Our method outperforms shallow fusion, while retaining functionality of the pretrained models by not altering any of the model weights. We further show that the adapter style training is superior to full-fine-tuning of the ASR models on datasets with user-defined content.

下载PDF全文

下载文献需遵守相关版权规定

论文标题