论文标题
统一:带离散单元的两通道直接语音到语音翻译
UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units
论文作者
论文摘要
直接语音到语音翻译(S2ST),可以共同优化所有组件,在级联的方法上是有利的,可以通过简化的管道快速推断。我们提出了一种新颖的两通行直接S2ST体系结构Unity,该体系首先生成文本表示并随后预测离散的声学单元。我们通过首次解码器,高级两通解码器架构设计和搜索策略以及更好的培训正则化来增强模型性能。为了利用大量未标记的文本数据,我们根据自我监督的Denoising自动编码任务预先训练了第一通道的文本解码器。在各种数据量表上进行基准数据集的实验评估表明,Unity的表现优于2.5-4.2 ASR-BLEU的单个通道语音转换模型,并具有2.83倍解码的速度。我们表明,即使在第二次通过预测频谱图时,提出的方法也可以提高性能。但是,与该情况相比,预测离散单位的解码速度为2.51倍。
Direct speech-to-speech translation (S2ST), in which all components can be optimized jointly, is advantageous over cascaded approaches to achieve fast inference with a simplified pipeline. We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units subsequently. We enhance the model performance by subword prediction in the first-pass decoder, advanced two-pass decoder architecture design and search strategy, and better training regularization. To leverage large amounts of unlabeled text data, we pre-train the first-pass text decoder based on the self-supervised denoising auto-encoding task. Experimental evaluations on benchmark datasets at various data scales demonstrate that UnitY outperforms a single-pass speech-to-unit translation model by 2.5-4.2 ASR-BLEU with 2.83x decoding speed-up. We show that the proposed methods boost the performance even when predicting spectrogram in the second pass. However, predicting discrete units achieves 2.51x decoding speed-up compared to that case.