统一的语音文本预培训，用于语音翻译和认可

论文标题

统一的语音文本预培训，用于语音翻译和认可

Unified Speech-Text Pre-training for Speech Translation and Recognition

论文作者

Tang, Yun, Gong, Hongyu, Dong, Ning, Wang, Changhan, Hsu, Wei-Ning, Gu, Jiatao, Baevski, Alexei, Li, Xian, Mohamed, Abdelrahman, Auli, Michael, Pino, Juan

论文摘要

我们描述了一种在编码器模型建模框架中共同预训练语音和文本的方法，用于语音翻译和识别。提出的方法结合了四个自制和监督的子任务，用于交叉模态学习。一个自我监督的语音子任务利用未标记的语音数据，而（自我监督的文本）用于文本子任务，则使用了丰富的文本培训数据。包括两个辅助监督的语音任务，以统一语音和文本建模空间。我们的贡献在于将文本语料库中的语言信息整合到演讲预训练中。详细的分析揭示了子任务之间的学习干扰。分别提出了两种用于语音翻译和识别的预训练配置，以减轻子任务的干扰。我们的实验表明，所提出的方法可以有效地将语音和文本信息融合到一个模型中。它在必要的语音翻译数据集上实现了高于最新技术的1.7和2.3 BLEU的改进，并且在LibrisPeech语音识别任务上，它与WAV2VEC 2.0相当。

We describe a method to jointly pre-train speech and text in an encoder-decoder modeling framework for speech translation and recognition. The proposed method incorporates four self-supervised and supervised subtasks for cross modality learning. A self-supervised speech subtask leverages unlabelled speech data, and a (self-)supervised text to text subtask makes use of abundant text training data. Two auxiliary supervised speech tasks are included to unify speech and text modeling space. Our contribution lies in integrating linguistic information from the text corpus into the speech pre-training. Detailed analysis reveals learning interference among subtasks. Two pre-training configurations for speech translation and recognition, respectively, are presented to alleviate subtask interference. Our experiments show the proposed method can effectively fuse speech and text information into one model. It achieves between 1.7 and 2.3 BLEU improvement above the state of the art on the MuST-C speech translation dataset and comparable WERs to wav2vec 2.0 on the Librispeech speech recognition task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题