MR-SVS：与多引用编码器一起唱歌语音合成

论文标题

MR-SVS：与多引用编码器一起唱歌语音合成

MR-SVS: Singing Voice Synthesis with Multi-Reference Encoder

论文作者

Wang, Shoutong, Liu, Jinglin, Ren, Yi, Wang, Zhen, Xu, Changliang, Zhao, Zhou

论文摘要

多演讲者的声音综合是生成不同扬声器演唱的歌声。为了推广到新的扬声器，以前的零拍调整方法获得了目标扬声器的音色，并带有单个参考音频的固定尺寸嵌入。但是，他们面临着几个挑战：1）固定尺寸的扬声器嵌入不足以捕获目标音色的完整细节； 2）单个参考音频不包含目标扬声器的足够音色信息； 3）不同扬声器之间的音高不一致也导致产生的声音降解。在本文中，我们提出了一种称为MR-SVS的新模型，以解决这些问题。具体而言，我们同时使用多参考编码器和固定尺寸的编码器来编码来自多个参考音频的目标扬声器的音色。多参考编码器可以捕获目标音色的更多详细信息和变化。此外，我们提出了一种精心设计的音高偏移方法，以解决俯仰不一致问题。实验表明，我们的方法在自然性和相似性上都优于基线方法。

Multi-speaker singing voice synthesis is to generate the singing voice sung by different speakers. To generalize to new speakers, previous zero-shot singing adaptation methods obtain the timbre of the target speaker with a fixed-size embedding from single reference audio. However, they face several challenges: 1) the fixed-size speaker embedding is not powerful enough to capture full details of the target timbre; 2) single reference audio does not contain sufficient timbre information of the target speaker; 3) the pitch inconsistency between different speakers also leads to a degradation in the generated voice. In this paper, we propose a new model called MR-SVS to tackle these problems. Specifically, we employ both a multi-reference encoder and a fixed-size encoder to encode the timbre of the target speaker from multiple reference audios. The Multi-reference encoder can capture more details and variations of the target timbre. Besides, we propose a well-designed pitch shift method to address the pitch inconsistency problem. Experiments indicate that our method outperforms the baseline method both in naturalness and similarity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题