Nansy ++：带有神经分析和合成的统一语音合成

论文标题

Nansy ++：带有神经分析和合成的统一语音合成

NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis

论文作者

Choi, Hyeong-Seok, Yang, Jinhyeok, Lee, Juheon, Kim, Hyeongju

论文摘要

尽管语音综合的各种应用是独立开发的，尽管事实将其作为共同的输出产生“语音”。此外，大多数语音综合模型仍然需要大量的音频数据与带注释的标签（例如，文本转录和音乐得分）进行训练。为此，我们提出了一个统一的统一框架，该框架是从分析功能（称为Nansy ++）中合成和操纵语音信号的框架。 Nansy ++的骨干网络以一种自我监督的方式进行了训练，该方式不需要任何与音频配对的注释。在训练骨干网络之后，我们通过部分建模每个任务所需的分析功能，有效地处理四个语音应用程序 - 即语音转换，文本到语音综合和语音设计。广泛的实验表明，所提出的框架具有竞争优势，例如可控性，数据效率和快速培训收敛，同时提供高质量的综合。音频样本：tinyurl.com/8tnsy3uc。

Various applications of voice synthesis have been developed independently despite the fact that they generate "voice" as output in common. In addition, most of the voice synthesis models still require a large number of audio data paired with annotated labels (e.g., text transcription and music score) for training. To this end, we propose a unified framework of synthesizing and manipulating voice signals from analysis features, dubbed NANSY++. The backbone network of NANSY++ is trained in a self-supervised manner that does not require any annotations paired with audio. After training the backbone network, we efficiently tackle four voice applications - i.e. voice conversion, text-to-speech, singing voice synthesis, and voice designing - by partially modeling the analysis features required for each task. Extensive experiments show that the proposed framework offers competitive advantages such as controllability, data efficiency, and fast training convergence, while providing high quality synthesis. Audio samples: tinyurl.com/8tnsy3uc.

下载PDF全文

下载文献需遵守相关版权规定

论文标题