螺旋：自我监督的扰动 - 侵入不变的表示语音预训练

论文标题

螺旋：自我监督的扰动 - 侵入不变的表示语音预训练

SPIRAL: Self-supervised Perturbation-Invariant Representation Learning for Speech Pre-Training

论文作者

Huang, Wenyong, Zhang, Zhenhe, Yeung, Yu Ting, Jiang, Xin, Liu, Qun

论文摘要

我们介绍了一种新的语音预训练方法，名为Spiral，该方法通过在教师学生框架中学习扰动数据的代表来起作用。具体来说，鉴于语音发言，我们首先将话语馈送到教师网络中以获得相应的表示。然后，同样的话语会受到干扰并喂入学生网络。培训了学生网络，以输出类似于老师的表示。同时，教师网络被更新，因为学生的体重在培训步骤上的移动平均值。为了防止代表性崩溃，我们将内部性对比损失作为训练前的目标，并将其施加在教师的输入上随机分配。与最先进的语音预训练方法WAV2VEC 2.0相比，螺旋可实现竞争性或更好的结果，并显着降低了训练成本（基本模型为80％，大型模型为65％）。此外，我们解决了对现实世界中语音应用至关重要的噪声问题的问题。我们通过用各种类型的添加剂噪声扰动学生的输入来提出多条件预训练。我们证明，与仅在微调阶段应用多条件训练相比，多条件预训练的螺旋模型对嘈杂的语音更适合嘈杂的语音（9.0％-13.3％的相对单词错误率降低）。源代码可从https://github.com/huawei-noah/speech-backbones/tree/main/spiral获得。

We introduce a new approach for speech pre-training named SPIRAL which works by learning denoising representation of perturbed data in a teacher-student framework. Specifically, given a speech utterance, we first feed the utterance to a teacher network to obtain corresponding representation. Then the same utterance is perturbed and fed to a student network. The student network is trained to output representation resembling that of the teacher. At the same time, the teacher network is updated as moving average of student's weights over training steps. In order to prevent representation collapse, we apply an in-utterance contrastive loss as pre-training objective and impose position randomization on the input to the teacher. SPIRAL achieves competitive or better results compared to state-of-the-art speech pre-training method wav2vec 2.0, with significant reduction of training cost (80% for BASE model, 65% for LARGE model). Furthermore, we address the problem of noise-robustness that is critical to real-world speech applications. We propose multi-condition pre-training by perturbing the student's input with various types of additive noise. We demonstrate that multi-condition pre-trained SPIRAL models are more robust to noisy speech (9.0% - 13.3% relative word error rate reduction on real noisy test data), compared to applying multi-condition training solely in the fine-tuning stage. Source code is available at https://github.com/huawei-noah/Speech-Backbones/tree/main/SPIRAL.

下载PDF全文

下载文献需遵守相关版权规定

论文标题