源滤波器Hifi-gan：快速和俯仰可控的高保真神经声码器

论文标题

源滤波器Hifi-gan：快速和俯仰可控的高保真神经声码器

Source-Filter HiFi-GAN: Fast and Pitch Controllable High-Fidelity Neural Vocoder

论文作者

Yoneyama, Reo, Wu, Yi-Chiao, Toda, Tomoki

论文摘要

我们以前的工作是Unified Source-Flater Gan（USFGAN）Vocoder，将基于源过滤器理论的新颖体系结构引入了并行波形生成对抗网络，以实现高声音质量和高音可控性。但是，高时间分辨率输入导致高计算成本。尽管Hifi-Gan Vocoder得益于高效的基于UPPLING的发电机体系结构，可以实现快速的高保真语音生成，但俯仰可控性受到严重限制。为了实现快速，可控制的高保真神经声码编码器，我们将源过滤器理论介绍给HIFI-GAN，通过层次结构衡量的源源兴奋信息，从层次结构调节共振滤波网络。根据实验结果，我们提出的方法在单个CPU上的语音质量和合成速度上的语音质量和综合速度上的唱歌效果优于Hifi-Gan和USFGAN。此外，与USFGAN VOCODER不同，所提出的方法可以轻松地在实时应用程序和端到端系统中采用/集成。

Our previous work, the unified source-filter GAN (uSFGAN) vocoder, introduced a novel architecture based on the source-filter theory into the parallel waveform generative adversarial network to achieve high voice quality and pitch controllability. However, the high temporal resolution inputs result in high computation costs. Although the HiFi-GAN vocoder achieves fast high-fidelity voice generation thanks to the efficient upsampling-based generator architecture, the pitch controllability is severely limited. To realize a fast and pitch-controllable high-fidelity neural vocoder, we introduce the source-filter theory into HiFi-GAN by hierarchically conditioning the resonance filtering network on a well-estimated source excitation information. According to the experimental results, our proposed method outperforms HiFi-GAN and uSFGAN on a singing voice generation in voice quality and synthesis speed on a single CPU. Furthermore, unlike the uSFGAN vocoder, the proposed method can be easily adopted/integrated in real-time applications and end-to-end systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题