带有未配对语音数据的端到端ASR模型的预训练变压器解码器

论文标题

带有未配对语音数据的端到端ASR模型的预训练变压器解码器

Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data

论文作者

Ao, Junyi, Zhang, Ziqiang, Zhou, Long, Liu, Shujie, Li, Haizhou, Ko, Tom, Dai, Lirong, Li, Jinyu, Qian, Yao, Wei, Furu

论文摘要

本文研究了一种新型的训练技术，该技术具有未配对的语音数据Segend2C，用于基于编码器的自动语音识别（ASR）。在多任务学习框架内，我们使用声音单元（即伪代码）介绍了两个编码器网络的预训练任务，该任务源自离线聚类模型。一种是通过在编码器输出中通过掩盖语言建模来预测伪代码，例如Hubert模型，而另一个使解码器学会学会重建伪代码自动加工，而不是生成文本脚本。通过这种方式，解码器在学习生成正确的文本之前学会了用代码重建原始语音信息。在Librispeech语料库上进行的全面实验表明，在没有解码器预训练的情况下，提出的Segrip2c可以相对将单词错误率（WER）降低19.2％，并且在10H和100H的微调子方面的最先进的WAV2VEC 2.0和HUBERT的表现也显着超过了最先进的WAV2VEC 2.0和HUBERT。我们在https://github.com/microsoft/speecht5/tree/main/main/speech2c上发布代码和模型。

This paper studies a novel pre-training technique with unpaired speech data, Speech2C, for encoder-decoder based automatic speech recognition (ASR). Within a multi-task learning framework, we introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes, derived from an offline clustering model. One is to predict the pseudo codes via masked language modeling in encoder output, like HuBERT model, while the other lets the decoder learn to reconstruct pseudo codes autoregressively instead of generating textual scripts. In this way, the decoder learns to reconstruct original speech information with codes before learning to generate correct text. Comprehensive experiments on the LibriSpeech corpus show that the proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training, and also outperforms significantly the state-of-the-art wav2vec 2.0 and HuBERT on fine-tuning subsets of 10h and 100h. We release our code and model at https://github.com/microsoft/SpeechT5/tree/main/Speech2C.

下载PDF全文

下载文献需遵守相关版权规定

论文标题