论文标题
Dialoguenerf:迈向现实的头像面对面的对话视频
DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video Generation
论文作者
论文摘要
对话是Metaverse中虚拟头像活动的重要组成部分。随着自然语言处理的发展,文本和人声对话的产生取得了重大突破。但是,面对面的对话介绍了绝大多数日常对话,而大多数现有的方法都集中在单人谈话中。在这项工作中,我们迈出了一步,并考虑生成现实的面对面对话视频。对话的产生比单人谈话的人更具挑战性,因为它不仅需要产生照片真实的个人说话头,而且还要求听众对演讲者做出回应。在本文中,我们提出了一个基于神经辐射场(NERF)的新型统一框架来解决此任务。具体来说,我们用NERF框架对扬声器和听众进行建模,并具有不同的条件以控制单个表达式。扬声器由音频信号驱动,而听众的响应取决于视觉和声学信息。这样,人类化身之间就产生了面对面的对话视频,所有对话者都在同一网络中建模。此外,为了促进对这项任务的未来研究,我们收集了一个新的人类对话数据集,其中包含34个视频片段。定量和定性实验在不同方面评估我们的方法,例如图像质量,姿势序列趋势和渲染视频的自然性。实验结果表明,结果视频中的化身能够进行逼真的对话并保持单独的样式。所有代码,数据和模型都将公开可用。
Conversation is an essential component of virtual avatar activities in the metaverse. With the development of natural language processing, textual and vocal conversation generation has achieved a significant breakthrough. However, face-to-face conversations account for the vast majority of daily conversations, while most existing methods focused on single-person talking head generation. In this work, we take a step further and consider generating realistic face-to-face conversation videos. Conversation generation is more challenging than single-person talking head generation, since it not only requires generating photo-realistic individual talking heads but also demands the listener to respond to the speaker. In this paper, we propose a novel unified framework based on neural radiance field (NeRF) to address this task. Specifically, we model both the speaker and listener with a NeRF framework, with different conditions to control individual expressions. The speaker is driven by the audio signal, while the response of the listener depends on both visual and acoustic information. In this way, face-to-face conversation videos are generated between human avatars, with all the interlocutors modeled within the same network. Moreover, to facilitate future research on this task, we collect a new human conversation dataset containing 34 clips of videos. Quantitative and qualitative experiments evaluate our method in different aspects, e.g., image quality, pose sequence trend, and naturalness of the rendering videos. Experimental results demonstrate that the avatars in the resulting videos are able to perform a realistic conversation, and maintain individual styles. All the code, data, and models will be made publicly available.