两次通行的端到端ASR模型压缩

论文标题

两次通行的端到端ASR模型压缩

Two-Pass End-to-End ASR Model Compression

论文作者

Dawalatabad, Nauman, Vatsal, Tushar, Gupta, Ashutosh, Kim, Sungsoo, Singh, Shatrughan, Gowda, Dhananjaya, Kim, Chanwoo

论文摘要

智能设备上的语音识别由于记忆范围较小而挑战。因此，小尺寸的ASR模型是可取的。通过使用流行的基于传感器的模型，实际上可以在小型设备上部署流语音识别模型[1]。最近，将RNN-T和LAS模块组合的两次通行模型[2]显示了流媒体上的语音识别的出色性能。在这项工作中，我们提出了一种简单有效的方法，以减少用于内存约束设备的两通道模型的大小。我们使用教师培训技术在三个阶段采用流行的知识蒸馏方法。在第一阶段，我们使用训练有素的RNN-T模型作为教师模型，并执行知识蒸馏以训练学生RNN-T模型。第二阶段使用共享编码器，并使用训练有素的RNN-T+LAS教师模型训练LAS委员会为学生模型训练。最后，我们使用共享的RNN-T编码器RNN-T解码器和LAS Rescorer对学生模型进行了深入的结论。我们对标准Librispeech数据集的实验结果表明，与两种通道的教师模型相比，我们的系统可以达到55％的高压缩率，而不会显着降解。

Speech recognition on smart devices is challenging owing to the small memory footprint. Hence small size ASR models are desirable. With the use of popular transducer-based models, it has become possible to practically deploy streaming speech recognition models on small devices [1]. Recently, the two-pass model [2] combining RNN-T and LAS modules has shown exceptional performance for streaming on-device speech recognition. In this work, we propose a simple and effective approach to reduce the size of the two-pass model for memory-constrained devices. We employ a popular knowledge distillation approach in three stages using the Teacher-Student training technique. In the first stage, we use a trained RNN-T model as a teacher model and perform knowledge distillation to train the student RNN-T model. The second stage uses the shared encoder and trains a LAS rescorer for student model using the trained RNN-T+LAS teacher model. Finally, we perform deep-finetuning for the student model with a shared RNN-T encoder, RNN-T decoder, and LAS rescorer. Our experimental results on standard LibriSpeech dataset show that our system can achieve a high compression rate of 55% without significant degradation in the WER compared to the two-pass teacher model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题