论文标题
基于语音分解和辅助特征的声学反向反演
Acoustic-to-articulatory Inversion based on Speech Decomposition and Auxiliary Feature
论文作者
论文摘要
声学倒置(AAI)是为了从语音信号中获取枢纽的运动。到目前为止,鉴于数据有限,实现与说话者无关的AAI仍然是一个挑战。此外,大多数当前作品仅将音频演讲作为输入,从而导致不可避免的性能瓶颈。为了解决这些问题,首先,我们预先培训语音分解网络,将音频语音分解为说话者嵌入和内容嵌入的内容,作为新的个性化语音特征,以适应与说话者无关的案例。其次,为了进一步改善AAI,我们提出了一个新颖的辅助功能网络,以估算上述个性化语音特征的唇部辅助功能。三个公共数据集的实验结果表明,与仅使用音频语音功能的最先进的方法相比,所提出的方法将平均RMSE降低了0.25,并且在依赖扬声器依赖性情况下的平均相关系数增加了2.0%。更重要的是,在与说话者无关的情况下,平均RMSE降低了0.29,平均相关系数增加了5.0%。
Acoustic-to-articulatory inversion (AAI) is to obtain the movement of articulators from speech signals. Until now, achieving a speaker-independent AAI remains a challenge given the limited data. Besides, most current works only use audio speech as input, causing an inevitable performance bottleneck. To solve these problems, firstly, we pre-train a speech decomposition network to decompose audio speech into speaker embedding and content embedding as the new personalized speech features to adapt to the speaker-independent case. Secondly, to further improve the AAI, we propose a novel auxiliary feature network to estimate the lip auxiliary features from the above personalized speech features. Experimental results on three public datasets show that, compared with the state-of-the-art only using the audio speech feature, the proposed method reduces the average RMSE by 0.25 and increases the average correlation coefficient by 2.0% in the speaker-dependent case. More importantly, the average RMSE decreases by 0.29 and the average correlation coefficient increases by 5.0% in the speaker-independent case.