具有CNN和LSTM的新型语音驱动唇部同步模型

论文标题

具有CNN和LSTM的新型语音驱动唇部同步模型

A Novel Speech-Driven Lip-Sync Model with CNN and LSTM

论文作者

Li, Xiaohong, Wang, Xiang, Wang, Kai, Lian, Shiguo

论文摘要

与语音产生同步和自然的唇部运动是创建现实的虚拟字符的最重要任务之一。在本文中，我们提出了一个一维卷积和LSTM的深层神经网络，以从可变的长度语音输入中生成3D模板面模型的顶点位移。面部下部的运动由3D唇形的顶点运动表示，与输入语音一致。为了增强网络对不同的声音信号的鲁棒性，我们适应了训练有素的语音识别模型来提取语音功能，并采用了速度损失项来减少产生的面部动画的抖动。我们录制了一系列中国人说话的普通话的视频，并创建了一个新的语音动画数据集，以弥补缺乏此类公共数据的信息。定性和定量评估表明，我们的模型能够产生与语音同步的平滑自然唇部运动。

Generating synchronized and natural lip movement with speech is one of the most important tasks in creating realistic virtual characters. In this paper, we present a combined deep neural network of one-dimensional convolutions and LSTM to generate vertex displacement of a 3D template face model from variable-length speech input. The motion of the lower part of the face, which is represented by the vertex movement of 3D lip shapes, is consistent with the input speech. In order to enhance the robustness of the network to different sound signals, we adapt a trained speech recognition model to extract speech feature, and a velocity loss term is adopted to reduce the jitter of generated facial animation. We recorded a series of videos of a Chinese adult speaking Mandarin and created a new speech-animation dataset to compensate the lack of such public data. Qualitative and quantitative evaluations indicate that our model is able to generate smooth and natural lip movements synchronized with speech.

下载PDF全文

下载文献需遵守相关版权规定

论文标题