DFA-NERF：通过解开的脸部属性神经渲染的个性化谈话头产生

论文标题

DFA-NERF：通过解开的脸部属性神经渲染的个性化谈话头产生

DFA-NeRF: Personalized Talking Head Generation via Disentangled Face Attributes Neural Rendering

论文作者

Yao, Shunyu, Zhong, RuiZhe, Yan, Yichao, Zhai, Guangtao, Yang, Xiaokang

论文摘要

尽管最深层神经网络的最新进展使得呈现高质量的图像成为可能，但产生照片现实和个性化的说话头仍然具有挑战性。通过给定的音频，解决此任务的关键是同步唇部运动，并同时产生个性化属性，例如头部运动和眼睛眨眼。在这项工作中，我们观察到输入音频与唇部运动高度相关，而与其他个性化属性（例如，头部移动）的相关性较小。受到这一点的启发，我们提出了一个基于神经辐射领域的新型框架，以追求高保真和个性化的说话校长。具体而言，神经辐射场采用唇部运动特征和个性化属性作为两个分离的条件，其中直接从音频输入中预测唇部运动以实现唇部同步的生成。同时，个性化属性是从概率模型中取样的，在该模型中，我们设计了一种基于变压器的自动编码器，该属性是从高斯过程中采样的，以学习合理且自然的头部姿势和眼睛闪烁。在几个基准上进行的实验表明，我们的方法比最先进的方法取得了明显更好的结果。

While recent advances in deep neural networks have made it possible to render high-quality images, generating photo-realistic and personalized talking head remains challenging. With given audio, the key to tackling this task is synchronizing lip movement and simultaneously generating personalized attributes like head movement and eye blink. In this work, we observe that the input audio is highly correlated to lip motion while less correlated to other personalized attributes (e.g., head movements). Inspired by this, we propose a novel framework based on neural radiance field to pursue high-fidelity and personalized talking head generation. Specifically, neural radiance field takes lip movements features and personalized attributes as two disentangled conditions, where lip movements are directly predicted from the audio inputs to achieve lip-synchronized generation. In the meanwhile, personalized attributes are sampled from a probabilistic model, where we design a Transformer-based variational autoencoder sampled from Gaussian Process to learn plausible and natural-looking head pose and eye blink. Experiments on several benchmarks demonstrate that our method achieves significantly better results than state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题