论文标题

部分可观测时空混沌系统的无模型预测

Rhythmic Gesticulator: Rhythm-Aware Co-Speech Gesture Synthesis with Hierarchical Neural Embeddings

论文作者

Ao, Tenglong, Gao, Qingzhe, Lou, Yuke, Chen, Baoquan, Liu, Libin

论文摘要

在人工体现的代理创建中,自动合成现实的共同语音手势是越来越重要但挑战性的任务。以前的系统主要集中于以端到端的方式产生手势,这导致由于语音和手势之间复杂而微妙的和谐而导致挖掘清晰的节奏和语义的困难。我们提出了一种新型的共同语音手势合成方法,该方法在节奏和语义上都取得了令人信服的结果。对于节奏,我们的系统包含一个强大的基于节奏的分割管道,以确保发声和手势之间的时间连贯性。对于手势语义,我们设计了一种基于语言理论的语音和运动的低水平和高级神经嵌入的机制。高级嵌入对应于语义,而低水平的嵌入与微妙的变化有关。最后,我们在语音的层次嵌入与运动之间建立对应关系,从而导致节奏和语义意识到的手势合成。与现有的客观指标,新提出的节奏指标和人类反馈的评估表明,我们的方法以明确的余量优于最先进的系统。

Automatic synthesis of realistic co-speech gestures is an increasingly important yet challenging task in artificial embodied agent creation. Previous systems mainly focus on generating gestures in an end-to-end manner, which leads to difficulties in mining the clear rhythm and semantics due to the complex yet subtle harmony between speech and gestures. We present a novel co-speech gesture synthesis method that achieves convincing results both on the rhythm and semantics. For the rhythm, our system contains a robust rhythm-based segmentation pipeline to ensure the temporal coherence between the vocalization and gestures explicitly. For the gesture semantics, we devise a mechanism to effectively disentangle both low- and high-level neural embeddings of speech and motion based on linguistic theory. The high-level embedding corresponds to semantics, while the low-level embedding relates to subtle variations. Lastly, we build correspondence between the hierarchical embeddings of the speech and the motion, resulting in rhythm- and semantics-aware gesture synthesis. Evaluations with existing objective metrics, a newly proposed rhythmic metric, and human feedback show that our method outperforms state-of-the-art systems by a clear margin.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源