论文标题

部分可观测时空混沌系统的无模型预测

AccentSpeech: Learning Accent from Crowd-sourced Data for Target Speaker TTS with Accents

论文作者

Zhang, Yongmao, Wang, Zhichao, Yang, Peiji, Sun, Hongshen, Wang, Zhisheng, Xie, Lei

论文摘要

从众包数据中学习口音是实现目标扬声器TTS系统的一种可行方法,该系统可以综合口音语音。为此,有两个具有挑战性的问题要解决。首先,直接使用较差的原声质量人群数据和重音转移中的目标扬声器数据显然会导致质量下降的综合语音。为了减轻这个问题,我们采用基于瓶颈功能(BN)的TTS方法,其中TTS被分解为文本到BN(T2BN)模块,以学习口音和BN-TO-MEL(BN2MEL)模块,以学习扬声器Timbre,其中基于神经网络的BN功能可作为IntermedMedimpersimpersiment Medimperiatients Intermidiate代表性代表性。其次,使用众包数据中的直接培训T2BN在两阶段系统中将产生目标扬声器的强调语音,而韵律差。这是因为众包录音是由普通的非专业演讲者做出的。为了解决这个问题,我们将两阶段的方法更新为一种新颖的三阶段方法,在该方法中,使用高质量的目标扬声器数据对T2BN和BN2MEL进行了培训,并且在两个模块之间插入了一个新的BN-BN-BN模块以执行重音转移。为了训练BN2BN模块,通过建议的数据增强程序获得了平行的未重音和重音BN特征。最后,提议的三阶段方法设法用良好的韵律为目标说话者发表强调语音,因为韵律模式是从专业的目标扬声器继承的,而口音转移是由BN2BN模块同时实现的。所提出的方法(称为AccentsPeech)已通过普通话TTS重音转移任务进行了验证。

Learning accent from crowd-sourced data is a feasible way to achieve a target speaker TTS system that can synthesize accent speech. To this end, there are two challenging problems to be solved. First, direct use of the poor acoustic quality crowd-sourced data and the target speaker data in accent transfer will apparently lead to synthetic speech with degraded quality. To mitigate this problem, we take a bottleneck feature (BN) based TTS approach, in which TTS is decomposed into a Text-to-BN (T2BN) module to learn accent and a BN-to-Mel (BN2Mel) module to learn speaker timbre, where neural network based BN feature serves as the intermediate representation that are robust to noise interference. Second, direct training T2BN using the crowd-sourced data in the two-stage system will produce accent speech of target speaker with poor prosody. This is because the the crowd-sourced recordings are contributed from the ordinary unprofessional speakers. To tackle this problem, we update the two-stage approach to a novel three-stage approach, where T2BN and BN2Mel are trained using the high-quality target speaker data and a new BN-to-BN module is plugged in between the two modules to perform accent transfer. To train the BN2BN module, the parallel unaccented and accented BN features are obtained by a proposed data augmentation procedure. Finally the proposed three-stage approach manages to produce accent speech for the target speaker with good prosody, as the prosody pattern is inherited from the professional target speaker and accent transfer is achieved by the BN2BN module at the same time. The proposed approach, named as AccentSpeech, is validated in a Mandarin TTS accent transfer task.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源