论文标题
使用直觉韵律特征可控神经文本到语音综合
Controllable neural text-to-speech synthesis using intuitive prosodic features
论文作者
论文摘要
现代的神经文本到语音(TTS)综合可以产生与自然语音无法区分的语音。但是,生成的话语的韵律通常代表数据库的平均韵律风格,而不是具有广泛的韵律变化。此外,生成的韵律完全由输入文本定义,该文本不允许对同一句子使用不同的样式。在这项工作中,我们训练以声音特征为条件的序列到序列神经网络,以学习具有直观和有意义的维度的潜在韵律空间。实验表明,以句子的音调,音高范围,电话持续时间,能量和频谱倾斜为条件的模型可以有效地控制每个韵律维度并生成各种说话样式,同时保持与塔科特朗基线(4.26)相似的平均意见分数(4.23)(4.26)。
Modern neural text-to-speech (TTS) synthesis can generate speech that is indistinguishable from natural speech. However, the prosody of generated utterances often represents the average prosodic style of the database instead of having wide prosodic variation. Moreover, the generated prosody is solely defined by the input text, which does not allow for different styles for the same sentence. In this work, we train a sequence-to-sequence neural network conditioned on acoustic speech features to learn a latent prosody space with intuitive and meaningful dimensions. Experiments show that a model conditioned on sentence-wise pitch, pitch range, phone duration, energy, and spectral tilt can effectively control each prosodic dimension and generate a wide variety of speaking styles, while maintaining similar mean opinion score (4.23) to our Tacotron baseline (4.26).