论文标题
世界一致的视频与视频综合
World-Consistent Video-to-Video Synthesis
论文作者
论文摘要
视频对视频综合(VID2VID)旨在将高级语义输入转换为逼真的视频。尽管现有的VID2VID方法可以达到短期时间一致性,但它们无法确保长期的时间一致性。这是因为他们缺乏对3D世界呈现的知识,并且仅根据过去的几帧才能生成每个帧。为了解决限制,我们引入了一个新颖的VID2VID框架,该框架在渲染过程中有效地利用了过去生成的所有帧。这是通过凝结到迄今为止对当前框架的物理基础估计的3D世界来实现的,我们称之为指导图像。我们进一步提出了一种新型的神经网络体系结构,以利用指南图像中存储的信息。几个具有挑战性的数据集的广泛实验结果验证了我们方法在实现世界一致性方面的有效性 - 在整个渲染的3D世界中,输出视频是一致的。 https://nvlabs.github.io/wc-vid2vid/
Video-to-video synthesis (vid2vid) aims for converting high-level semantic inputs to photorealistic videos. While existing vid2vid methods can achieve short-term temporal consistency, they fail to ensure the long-term one. This is because they lack knowledge of the 3D world being rendered and generate each frame only based on the past few frames. To address the limitation, we introduce a novel vid2vid framework that efficiently and effectively utilizes all past generated frames during rendering. This is achieved by condensing the 3D world rendered so far into a physically-grounded estimate of the current frame, which we call the guidance image. We further propose a novel neural network architecture to take advantage of the information stored in the guidance images. Extensive experimental results on several challenging datasets verify the effectiveness of our approach in achieving world consistency - the output video is consistent within the entire rendered 3D world. https://nvlabs.github.io/wc-vid2vid/