基于面部图像的残留引导的个性化语音综合

论文标题

基于面部图像的残留引导的个性化语音综合

Residual-guided Personalized Speech Synthesis based on Face Image

论文作者

Wang, Jianrong, Wang, Zixuan, Hu, Xiaosheng, Li, Xuewei, Fang, Qiang, Liu, Li

论文摘要

以前的作品通过在由他/她的音频声音组成的大型数据集中训练模型来得出个性化的语音功能。据报道，面部信息与语音的链接有很强的联系。因此，在这项工作中，我们从人的面孔中富有创新的个性化语音特征，以合成使用Neural Vocoder的个性化语音。一个基于面部的残留个性化语音合成模型（FR-PSS），其中包含语音编码器，语音合成器和面部编码器是为PSS设计的。在该模型中，通过设计两个语音先验，引入了一个残留的指导策略来指导面部功能以接近培训中的真实语音功能。此外，考虑到特征的绝对值及其定向偏差的误差，我们为面部编码器制定了一种新颖的三个项目损失函数。实验结果表明，我们模型合成的语音与通过训练先前工作中大量音频数据合成的个性化语音可比。

Previous works derive personalized speech features by training the model on a large dataset composed of his/her audio sounds. It was reported that face information has a strong link with the speech sound. Thus in this work, we innovatively extract personalized speech features from human faces to synthesize personalized speech using neural vocoder. A Face-based Residual Personalized Speech Synthesis Model (FR-PSS) containing a speech encoder, a speech synthesizer and a face encoder is designed for PSS. In this model, by designing two speech priors, a residual-guided strategy is introduced to guide the face feature to approach the true speech feature in the training. Moreover, considering the error of feature's absolute values and their directional bias, we formulate a novel tri-item loss function for face encoder. Experimental results show that the speech synthesized by our model is comparable to the personalized speech synthesized by training a large amount of audio data in previous works.

下载PDF全文

下载文献需遵守相关版权规定

论文标题