使用VQWAV2VEC和动态卷积的有效的非自动进取的GAN语音转换

论文标题

使用VQWAV2VEC和动态卷积的有效的非自动进取的GAN语音转换

Efficient Non-Autoregressive GAN Voice Conversion using VQWav2vec Features and Dynamic Convolution

论文作者

Chen, Mingjie, Zhou, Yanghao, Huang, Heyan, Hain, Thomas

论文摘要

最近显示，ASR和TTS模型的组合在标准语音转换任务（例如2020年语音转换挑战（VCC2020））上产生了高度竞争性的性能。为了获得良好的性能，两种模型都需要在大量数据上进行预处理，从而获得了潜在使用效率低下的大型模型。在这项工作中，我们提出了一个模型，该模型明显较小，因此在获得等效性能的同时，处理速度更快。为了实现这一目标，Dynamic-GAN-VC（DYGAN-VC）使用了非自动回旋结构，并使用从VQWAV2VEC模型获得的矢量定量嵌入。此外，还引入了动态卷积，以改善语音内容建模，同时需要少量参数。使用VCC2020任务进行客观和主观评估，得出的MOS得分高达3.86，字符错误率低至4.3 \％。这是模型参数数量的大约一半，最多要快8倍的解码速度来实现。

It was shown recently that a combination of ASR and TTS models yield highly competitive performance on standard voice conversion tasks such as the Voice Conversion Challenge 2020 (VCC2020). To obtain good performance both models require pretraining on large amounts of data, thereby obtaining large models that are potentially inefficient in use. In this work we present a model that is significantly smaller and thereby faster in processing while obtaining equivalent performance. To achieve this the proposed model, Dynamic-GAN-VC (DYGAN-VC), uses a non-autoregressive structure and makes use of vector quantised embeddings obtained from a VQWav2vec model. Furthermore dynamic convolution is introduced to improve speech content modeling while requiring a small number of parameters. Objective and subjective evaluation was performed using the VCC2020 task, yielding MOS scores of up to 3.86, and character error rates as low as 4.3\%. This was achieved with approximately half the number of model parameters, and up to 8 times faster decoding speed.

下载PDF全文

下载文献需遵守相关版权规定

论文标题