styletts-vc：通过知识转移从基于样式的TTS模型转换的单发语音转换

论文标题

styletts-vc：通过知识转移从基于样式的TTS模型转换的单发语音转换

StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models

论文作者

Li, Yinghao Aaron, Han, Cong, Mesgarani, Nima

论文摘要

单发语音转换（VC）旨在将任何源说话者的语音转换为任意目标发言人，而目标发言人只有几秒钟的参考语音。这在很大程度上依赖于解散说话者的身份和语音内容，这一任务仍然充满挑战。在这里，我们提出了一种新颖的方法，通过从基于样式的文本到语音（TTS）模型转移学习来学习解开语音表示。通过循环一致和对抗性训练，基于样式的TTS模型可以以高保真和相似性执行转录引导的单发VC。通过通过教师知识转移和新颖的数据增强方案来学习其他MEL光谱编码器，我们的方法无需输入文本而导致分解的语音表示。主观评估表明，我们的方法在自然性和相似性上都可以显着优于先前最新的一声转换模型。

One-shot voice conversion (VC) aims to convert speech from any source speaker to an arbitrary target speaker with only a few seconds of reference speech from the target speaker. This relies heavily on disentangling the speaker's identity and speech content, a task that still remains challenging. Here, we propose a novel approach to learning disentangled speech representation by transfer learning from style-based text-to-speech (TTS) models. With cycle consistent and adversarial training, the style-based TTS models can perform transcription-guided one-shot VC with high fidelity and similarity. By learning an additional mel-spectrogram encoder through a teacher-student knowledge transfer and novel data augmentation scheme, our approach results in disentangled speech representation without needing the input text. The subjective evaluation shows that our approach can significantly outperform the previous state-of-the-art one-shot voice conversion models in both naturalness and similarity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题