多演讲者使用深层流程的文本到语音综合

论文标题

多演讲者使用深层流程的文本到语音综合

Multi-speaker Text-to-speech Synthesis Using Deep Gaussian Processes

论文作者

Mitsui, Kentaro, Koriyama, Tomoki, Saruwatari, Hiroshi

论文摘要

多演讲者语音综合是一种用单个模型来建模多个扬声器声音的技术。尽管已经提出了许多使用深神经网络（DNN）的方法，但是当训练数据的量受到限制时，DNN易于过度拟合。我们提出了一个使用深高斯过程（DGP）的多演讲者语音综合框架； DGP是贝叶斯内核回归的深度建筑，因此可以过度拟合。在此框架中，使用扬声器代码将扬声器信息提供给持续时间/声学模型。我们还研究了深高斯过程潜在变量模型（DGPLVM）的使用。在这种方法中，每个说话者的表示与其他模型参数同时学习，因此，说话者的相似性或相似性被认为有效地认为。我们通过实验评估了两种情况，以研究所提出方法的有效性。在一种情况下，每个说话者的数据量是平衡的（说话者均衡），另一方面，某些说话者的数据有限（说话者与平衡）。主观和客观的评估结果表明，在说话者平衡的情况下，DGP和DGPLVM合成多扬声器语音比DNN更有效。我们还发现，在说话者不平衡情况下，DGPLVM的表现明显优于DGP。

Multi-speaker speech synthesis is a technique for modeling multiple speakers' voices with a single model. Although many approaches using deep neural networks (DNNs) have been proposed, DNNs are prone to overfitting when the amount of training data is limited. We propose a framework for multi-speaker speech synthesis using deep Gaussian processes (DGPs); a DGP is a deep architecture of Bayesian kernel regressions and thus robust to overfitting. In this framework, speaker information is fed to duration/acoustic models using speaker codes. We also examine the use of deep Gaussian process latent variable models (DGPLVMs). In this approach, the representation of each speaker is learned simultaneously with other model parameters, and therefore the similarity or dissimilarity of speakers is considered efficiently. We experimentally evaluated two situations to investigate the effectiveness of the proposed methods. In one situation, the amount of data from each speaker is balanced (speaker-balanced), and in the other, the data from certain speakers are limited (speaker-imbalanced). Subjective and objective evaluation results showed that both the DGP and DGPLVM synthesize multi-speaker speech more effective than a DNN in the speaker-balanced situation. We also found that the DGPLVM outperforms the DGP significantly in the speaker-imbalanced situation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题