基于语音分解和辅助特征的声学反向反演

论文标题

基于语音分解和辅助特征的声学反向反演

Acoustic-to-articulatory Inversion based on Speech Decomposition and Auxiliary Feature

论文作者

Wang, Jianrong, Liu, Jinyu, Zhao, Longxuan, Wang, Shanyu, Yu, Ruiguo, Liu, Li

论文摘要

声学倒置（AAI）是为了从语音信号中获取枢纽的运动。到目前为止，鉴于数据有限，实现与说话者无关的AAI仍然是一个挑战。此外，大多数当前作品仅将音频演讲作为输入，从而导致不可避免的性能瓶颈。为了解决这些问题，首先，我们预先培训语音分解网络，将音频语音分解为说话者嵌入和内容嵌入的内容，作为新的个性化语音特征，以适应与说话者无关的案例。其次，为了进一步改善AAI，我们提出了一个新颖的辅助功能网络，以估算上述个性化语音特征的唇部辅助功能。三个公共数据集的实验结果表明，与仅使用音频语音功能的最先进的方法相比，所提出的方法将平均RMSE降低了0.25，并且在依赖扬声器依赖性情况下的平均相关系数增加了2.0％。更重要的是，在与说话者无关的情况下，平均RMSE降低了0.29，平均相关系数增加了5.0％。

Acoustic-to-articulatory inversion (AAI) is to obtain the movement of articulators from speech signals. Until now, achieving a speaker-independent AAI remains a challenge given the limited data. Besides, most current works only use audio speech as input, causing an inevitable performance bottleneck. To solve these problems, firstly, we pre-train a speech decomposition network to decompose audio speech into speaker embedding and content embedding as the new personalized speech features to adapt to the speaker-independent case. Secondly, to further improve the AAI, we propose a novel auxiliary feature network to estimate the lip auxiliary features from the above personalized speech features. Experimental results on three public datasets show that, compared with the state-of-the-art only using the audio speech feature, the proposed method reduces the average RMSE by 0.25 and increases the average correlation coefficient by 2.0% in the speaker-dependent case. More importantly, the average RMSE decreases by 0.29 and the average correlation coefficient increases by 5.0% in the speaker-independent case.

下载PDF全文

下载文献需遵守相关版权规定

论文标题