使用WAV2VEC 2.0识别器的深度LSTM口语检测

论文标题

使用WAV2VEC 2.0识别器的深度LSTM口语检测

Deep LSTM Spoken Term Detection using Wav2Vec 2.0 Recognizer

论文作者

Švec, Jan, Lehečka, Jan, Šmídl, Luboš

论文摘要

近年来，标准的混合DNN-HMM语音识别器的表现优于端到端的语音识别系统。非常有前途的方法之一是Gruseme Wav2Vec 2.0模型，该模型使用了自我监管的预处理方法与微调语音识别器的转移学习相结合。由于它缺乏发音词汇和语言模型，因此该方法适用于获得此类模型并不容易或几乎不可能的任务。在本文中，我们在大量的口语文档中使用WAV2VEC语音识别器。该方法采用了深层LSTM网络，该网络将识别的假设映射到搜索术语为共享的发音嵌入空间中，其中术语出现和分配的分数易于计算。本文描述了一种自举方法，该方法允许传统发音词汇中DNN-HMM混合ASR中包含的知识转移到基于素式的WAV2VEC的上下文中。该方法的表现优于先前发布的系统，该系统基于DNN-HMM Hybrid ASR和音素识别器的组合，通过英语和捷克语言的Malach数据的大幅度差距。

In recent years, the standard hybrid DNN-HMM speech recognizers are outperformed by the end-to-end speech recognition systems. One of the very promising approaches is the grapheme Wav2Vec 2.0 model, which uses the self-supervised pretraining approach combined with transfer learning of the fine-tuned speech recognizer. Since it lacks the pronunciation vocabulary and language model, the approach is suitable for tasks where obtaining such models is not easy or almost impossible. In this paper, we use the Wav2Vec speech recognizer in the task of spoken term detection over a large set of spoken documents. The method employs a deep LSTM network which maps the recognized hypothesis and the searched term into a shared pronunciation embedding space in which the term occurrences and the assigned scores are easily computed. The paper describes a bootstrapping approach that allows the transfer of the knowledge contained in traditional pronunciation vocabulary of DNN-HMM hybrid ASR into the context of grapheme-based Wav2Vec. The proposed method outperforms the previously published system based on the combination of the DNN-HMM hybrid ASR and phoneme recognizer by a large margin on the MALACH data in both English and Czech languages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题