论文标题
序列到序列语音转换的训练技术
Pretraining Techniques for Sequence-to-Sequence Voice Conversion
论文作者
论文摘要
序列到序列(SEQ2SEQ)语音转换(VC)模型由于其转换韵律的能力而具有吸引力。但是,如果没有足够的数据,SEQ2SEQ VC模型可能会遭受不稳定的训练和转换的语音错误问题的障碍,远离实际。为了解决这些缺点,我们建议从其他语音处理任务中转移知识,这些任务很容易获得,通常是文本到语音(TTS)和自动语音识别(ASR)。我们认为,使用经过验证的ASR或TTS模型参数初始化的VC模型可以生成有效的隐藏表示形式,以实现高保真性,高度可理解的转换语音。我们将这种技术应用于基于转换和变压器的复发性神经网络(RNN)模型,通过系统的实验,我们证明了预训练方案的有效性以及基于变压器模型的优势,而基于RNN的模型优于基于RNN的模型,就可理解性,天然性和相似性而言。
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody. Nonetheless, without sufficient data, seq2seq VC models can suffer from unstable training and mispronunciation problems in the converted speech, thus far from practical. To tackle these shortcomings, we propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR). We argue that VC models initialized with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech. We apply such techniques to recurrent neural network (RNN)-based and Transformer based models, and through systematical experiments, we demonstrate the effectiveness of the pretraining scheme and the superiority of Transformer based models over RNN-based models in terms of intelligibility, naturalness, and similarity.