长时间的视频生成时间不足的VQGAN和时间敏感的变压器

论文标题

长时间的视频生成时间不足的VQGAN和时间敏感的变压器

Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer

论文作者

Ge, Songwei, Hayes, Thomas, Yang, Harry, Yin, Xi, Pang, Guan, Jacobs, David, Huang, Jia-Bin, Parikh, Devi

论文摘要

创建视频是为了表达情感，交换信息和分享经验。视频合成很长时间以来一直吸引了研究人员。尽管视觉合成的进步驱动了迅速的进展，但大多数现有研究都集中在提高框架的质量和之间的过渡方面，而在产生更长的视频方面几乎没有取得进展。在本文中，我们提出了一种基于3D-VQGAN和Transformers的方法，以生成具有数千帧的视频。我们的评估表明，我们的模型在16架视频剪辑中训练了来自UCF-101，Sky TimeLapse和Taichi-HD数据集的标准基准测试片段，可以生成多样的，相干和高质量的长视频。我们还展示了我们通过将时间信息与文本和音频结合在一起来生成有意义的长视频的方法的条件扩展。可以在https://songweige.github.io/projects/tats/index.html上找到视频和代码。

Videos are created to express emotion, exchange information, and share experiences. Video synthesis has intrigued researchers for a long time. Despite the rapid progress driven by advances in visual synthesis, most existing studies focus on improving the frames' quality and the transitions between them, while little progress has been made in generating longer videos. In this paper, we present a method that builds on 3D-VQGAN and transformers to generate videos with thousands of frames. Our evaluation shows that our model trained on 16-frame video clips from standard benchmarks such as UCF-101, Sky Time-lapse, and Taichi-HD datasets can generate diverse, coherent, and high-quality long videos. We also showcase conditional extensions of our approach for generating meaningful long videos by incorporating temporal information with text and audio. Videos and code can be found at https://songweige.github.io/projects/tats/index.html.

下载PDF全文

下载文献需遵守相关版权规定

论文标题