论文标题
Hifisinger:迈向高保真神经歌唱声音综合
HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis
论文作者
论文摘要
高保真的歌声通常需要更高的采样率(例如48kHz)来传达表达和情感。但是,较高的采样率会导致更广泛的频带和更长的波形序列,并在频率和时域中引发了唱歌语音合成(SVS)的挑战。采用较小采样率的常规SVS系统不能很好地解决上述挑战。在本文中,我们开发了Hifisinger,这是一种用于高保真歌声的SVS系统。 Hifisinger由一个基于快速播音的声学模型和平行的Wavegan的Vocoder组成,以确保快速训练和推理以及高音质量。为了解决由高采样率(较大的频带和更长的波形)引起的唱歌建模的难度,我们在声学模型和Vocoder中引入了多尺度的对抗训练,以改善唱歌模型。 Specifically, 1) To handle the larger range of frequencies caused by higher sampling rate, we propose a novel sub-frequency GAN (SF-GAN) on mel-spectrogram generation, which splits the full 80-dimensional mel-frequency into multiple sub-bands and models each sub-band with a separate discriminator. 2)为了建模由较高的采样率引起的更长的波形序列,我们提出了一个多长度GAN(ML-GAN),用于波形生成,以模拟不同的波形序列的不同长度的波形序列。 3)我们还在Hifingiser中介绍了一些其他设计和发现,这些设计和发现对于高保真声音至关重要,例如添加F0(PITCH)和V/UV(音调/未配音的标志)作为声学特征,选择适合MEL-SPECTROGRAM的窗口/跳尺寸的窗口/跳尺寸,并为长vowel模型中的Vocoder中的受感受器增加而增加。实验结果表明,Hifisinger合成具有更高质量的高保真歌声:0.32/0.44 MOS超过48kHz/24kHz基线的增益,而先前的SVS系统比MOS增益为0.83。
High-fidelity singing voices usually require higher sampling rate (e.g., 48kHz) to convey expression and emotion. However, higher sampling rate causes the wider frequency band and longer waveform sequences and throws challenges for singing voice synthesis (SVS) in both frequency and time domains. Conventional SVS systems that adopt small sampling rate cannot well address the above challenges. In this paper, we develop HiFiSinger, an SVS system towards high-fidelity singing voice. HiFiSinger consists of a FastSpeech based acoustic model and a Parallel WaveGAN based vocoder to ensure fast training and inference and also high voice quality. To tackle the difficulty of singing modeling caused by high sampling rate (wider frequency band and longer waveform), we introduce multi-scale adversarial training in both the acoustic model and vocoder to improve singing modeling. Specifically, 1) To handle the larger range of frequencies caused by higher sampling rate, we propose a novel sub-frequency GAN (SF-GAN) on mel-spectrogram generation, which splits the full 80-dimensional mel-frequency into multiple sub-bands and models each sub-band with a separate discriminator. 2) To model longer waveform sequences caused by higher sampling rate, we propose a multi-length GAN (ML-GAN) for waveform generation to model different lengths of waveform sequences with separate discriminators. 3) We also introduce several additional designs and findings in HiFiSinger that are crucial for high-fidelity voices, such as adding F0 (pitch) and V/UV (voiced/unvoiced flag) as acoustic features, choosing an appropriate window/hop size for mel-spectrogram, and increasing the receptive field in vocoder for long vowel modeling. Experiment results show that HiFiSinger synthesizes high-fidelity singing voices with much higher quality: 0.32/0.44 MOS gain over 48kHz/24kHz baseline and 0.83 MOS gain over previous SVS systems.