混合语音BERT：用混合音素和Sup-phoneme表示文本的文本提高BERT

论文标题

混合语音BERT：用混合音素和Sup-phoneme表示文本的文本提高BERT

Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech

论文作者

Zhang, Guangyan, Song, Kaitao, Tan, Xu, Tan, Daxin, Yan, Yuzi, Liu, Yanqing, Wang, Gang, Zhou, Wei, Qin, Tao, Lee, Tan, Zhao, Sheng

论文摘要

最近，利用BERT预训练以改善文本到语音（TTS）中的音素编码器引起了人们的注意。但是，这些作品将使用基于字符的单元进行预训练以增强TTS音素编码器，这与以音素为输入的TTS微调不一致。仅以音素为输入可以减轻输入不匹配，但由于音素词汇有限而导致的语义信息和语义信息缺乏对输入不匹配。在本文中，我们提出了混合Phoneme Bert，这是BERT模型的新型变体，该模型使用混合音素和SUP-PHONEME表示来增强学习能力。具体而言，我们将相邻的音素合并为sup-phonemes，并将音素序列和合并的sup-phoneme序列组合为模型输入，这可以增强学习丰富上下文表示的模型能力。实验结果表明，与FastSpeeCh 2基线相比，我们提出的混合词BERT可以显着提高TTS性能，并以0.30 CMOS增益提高了TTS性能。混合词BERT达到3倍推理加速度和与先前TTS预训练的模型PNG Bert相似的语音质量

Recently, leveraging BERT pre-training to improve the phoneme encoder in text to speech (TTS) has drawn increasing attention. However, the works apply pre-training with character-based units to enhance the TTS phoneme encoder, which is inconsistent with the TTS fine-tuning that takes phonemes as input. Pre-training only with phonemes as input can alleviate the input mismatch but lack the ability to model rich representations and semantic information due to limited phoneme vocabulary. In this paper, we propose MixedPhoneme BERT, a novel variant of the BERT model that uses mixed phoneme and sup-phoneme representations to enhance the learning capability. Specifically, we merge the adjacent phonemes into sup-phonemes and combine the phoneme sequence and the merged sup-phoneme sequence as the model input, which can enhance the model capacity to learn rich contextual representations. Experiment results demonstrate that our proposed Mixed-Phoneme BERT significantly improves the TTS performance with 0.30 CMOS gain compared with the FastSpeech 2 baseline. The Mixed-Phoneme BERT achieves 3x inference speedup and similar voice quality to the previous TTS pre-trained model PnG BERT

下载PDF全文

下载文献需遵守相关版权规定

论文标题