论文标题
学习使用MOS预测神经文本到语音直接提高语音质量
Learning to Maximize Speech Quality Directly Using MOS Prediction for Neural Text-to-Speech
论文作者
论文摘要
尽管最近的神经文本到语音(TTS)系统已经达到了高质量的语音综合,但在某些情况下,TTS系统会产生低质量的语音,这主要是由于知识蒸馏过程中培训数据有限或信息丢失引起的。因此,我们提出了一种新的方法来通过在感知损失的监督下训练TTS模型来提高语音质量,该模型衡量了最大可能的语音质量评分与预测的距离之间的距离。我们首先预训练了平均意见评分(MOS)预测模型,然后训练TTS模型,以使用预训练的MOS预测模型最大化合成语音的MOS。无论TTS模型架构如何或语音质量降解的原因,都可以独立地应用所提出的方法,并且可以有效地不增加推理时间或模型复杂性。 MOS和电话错误率的评估结果表明,我们提出的方法在自然性和清晰度方面改善了以前的模型。
Although recent neural text-to-speech (TTS) systems have achieved high-quality speech synthesis, there are cases where a TTS system generates low-quality speech, mainly caused by limited training data or information loss during knowledge distillation. Therefore, we propose a novel method to improve speech quality by training a TTS model under the supervision of perceptual loss, which measures the distance between the maximum possible speech quality score and the predicted one. We first pre-train a mean opinion score (MOS) prediction model and then train a TTS model to maximize the MOS of synthesized speech using the pre-trained MOS prediction model. The proposed method can be applied independently regardless of the TTS model architecture or the cause of speech quality degradation and efficiently without increasing the inference time or model complexity. The evaluation results for the MOS and phone error rate demonstrate that our proposed approach improves previous models in terms of both naturalness and intelligibility.