论文标题
使用双向盖式RNN的声学到发音语音反演的音频数据增强
Audio Data Augmentation for Acoustic-to-articulatory Speech Inversion using Bidirectional Gated RNNs
论文作者
论文摘要
事实证明,数据增强是通过在培训数据中增加可变性来改善深度学习模型的性能的前景。在以前的工作中,我们在噪声到发音性语音倒置系统中进行了噪音,我们已经显示了增加噪音的重要性,以提高噪音语音中语音反演的性能。在这项工作中,我们比较和对比不同的方法来进行数据扩展,并展示了该技术如何改善发音语音反演的性能,不仅在嘈杂的语音上,而且在干净的语音数据上。我们还提出了双向门控复发性神经网络作为语音倒置系统,而不是先前使用的feed向前神经网络。反转系统将MEL频率的Cepstral系数(MFCC)用作输入声学特征和六个声带变量(TVS)作为输出表达功能。通过计算U. WISC上估计和实际电视之间的相关性来衡量系统的性能。 X射线Microbeam数据库。提出的语音倒置系统显示,对于干净的语音数据,基线噪声稳健系统的相关性相对相关性相对5%。预先训练的模型适应测试集中的每个看不见的说话者时,平均相关性将另外提高6%。
Data augmentation has proven to be a promising prospect in improving the performance of deep learning models by adding variability to training data. In previous work with developing a noise robust acoustic-to-articulatory speech inversion system, we have shown the importance of noise augmentation to improve the performance of speech inversion in noisy speech. In this work, we compare and contrast different ways of doing data augmentation and show how this technique improves the performance of articulatory speech inversion not only on noisy speech, but also on clean speech data. We also propose a Bidirectional Gated Recurrent Neural Network as the speech inversion system instead of the previously used feed forward neural network. The inversion system uses mel-frequency cepstral coefficients (MFCCs) as the input acoustic features and six vocal tract-variables (TVs) as the output articulatory features. The Performance of the system was measured by computing the correlation between estimated and actual TVs on the U. Wisc. X-ray Microbeam database. The proposed speech inversion system shows a 5% relative improvement in correlation over the baseline noise robust system for clean speech data. The pre-trained model, when adapted to each unseen speaker in the test set, improves the average correlation by another 6%.