Aishell-3：多演讲者普通话TTS语料库和基线

论文标题

Aishell-3：多演讲者普通话TTS语料库和基线

AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines

论文作者

Shi, Yao, Bu, Hui, Xu, Xin, Zhang, Shaoji, Li, Ming

论文摘要

在本文中，我们提出了Aishell-3，这是一个大规模且高保真的多演讲者普通话语料库，可用于训练多演讲者文本到语音（TTS）系统。该语料库包含大约85个小时的情绪与中性录音，由218个中国人普通话者说。他们的辅助属性（例如性别，年龄段和本地口音）在语料库中明确标记并提供。因此，提供了汉字级别和拼音级的成绩单以及录音。我们提出了一个基线系统，该系统将Aishell-3用于多演讲者Madarin语音综合。多扬声器语音合成系统是Tacotron-2上的扩展，其中说话者验证模型和有关语音相似性的相应损失被纳入了反馈约束。我们旨在使用呈现的语料库来构建能够实现零击语音克隆的强大合成模型。在此数据集上培训的系统还可以很好地概括在培训过程中从未见过的扬声器。我们实验的客观评估结果表明，拟议的多演讲者合成系统在嵌入相似性和相等的错误率测量方面达到了高声音相似性。数据集，基线系统代码和生成的样本可在线获得。

In this paper, we present AISHELL-3, a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems. The corpus contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese mandarin speakers. Their auxiliary attributes such as gender, age group and native accents are explicitly marked and provided in the corpus. Accordingly, transcripts in Chinese character-level and pinyin-level are provided along with the recordings. We present a baseline system that uses AISHELL-3 for multi-speaker Madarin speech synthesis. The multi-speaker speech synthesis system is an extension on Tacotron-2 where a speaker verification model and a corresponding loss regarding voice similarity are incorporated as the feedback constraint. We aim to use the presented corpus to build a robust synthesis model that is able to achieve zero-shot voice cloning. The system trained on this dataset also generalizes well on speakers that are never seen in the training process. Objective evaluation results from our experiments show that the proposed multi-speaker synthesis system achieves high voice similarity concerning both speaker embedding similarity and equal error rate measurement. The dataset, baseline system code and generated samples are available online.

下载PDF全文

下载文献需遵守相关版权规定

论文标题