具有有限和合成的语音数据的培训关键字查看器

论文标题

具有有限和合成的语音数据的培训关键字查看器

Training Keyword Spotters with Limited and Synthesized Speech Data

论文作者

Lin, James, Kilgour, Kevin, Roblek, Dominik, Sharifi, Matthew

论文摘要

随着低功率支持语音的设备的兴起，需求越来越多，以快速生产模型来识别任意关键字集。与许多机器学习任务一样，模型创建过程中最具挑战性的部分之一是获得足够数量的培训数据。在本文中，我们探讨了综合语音数据在训练大约40万参数的小型，口语术语检测模型中的有效性。我们没有直接在音频或低级别功能（例如MFCC）上训练此类模型，而是使用预先训练的语音嵌入模型，该模型训练有素，可以为关键字发现模型提取有用的功能。使用这种语音嵌入，我们表明，仅在合成语音上训练时检测10个关键字的模型等同于在500多个真实示例中训练的模型。我们还表明，没有语音嵌入的模型需要在4000多个真实示例上进行培训，以达到相同的精度。

With the rise of low power speech-enabled devices, there is a growing demand to quickly produce models for recognizing arbitrary sets of keywords. As with many machine learning tasks, one of the most challenging parts in the model creation process is obtaining a sufficient amount of training data. In this paper, we explore the effectiveness of synthesized speech data in training small, spoken term detection models of around 400k parameters. Instead of training such models directly on the audio or low level features such as MFCCs, we use a pre-trained speech embedding model trained to extract useful features for keyword spotting models. Using this speech embedding, we show that a model which detects 10 keywords when trained on only synthetic speech is equivalent to a model trained on over 500 real examples. We also show that a model without our speech embeddings would need to be trained on over 4000 real examples to reach the same accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题