Music2Video：与音频和文字融合的自动生成音乐视频

论文标题

Music2Video：与音频和文字融合的自动生成音乐视频

Music2Video: Automatic Generation of Music Video with fusion of audio and text

论文作者

Kim, Yoonjeon, Jang, Joel, Shin, Sumin

论文摘要

使用生成对抗网络创建图像，已广泛适应多模式制度，随着多模式表示模型的出现在大型语料库上进行了预先培训。共享共同表示空间的各种模式可以用于指导生成模型从文本甚至音频源创建图像。偏离了仅依靠文本或音频的以前方法，我们利用了两种模态的表现力。根据文本和音频的融合，我们创建了视频，其内容与提供的不同方式一致。一种简单的方法将视频自动将视频分为可变的长度间隔并保持生成视频中的时间一致性是我们的方法的一部分。我们提出的生成音乐视频的框架显示了在应用程序级别中的有希望的结果，用户可以在音乐源和文本源以创建艺术音乐视频的方式交互式馈送。我们的代码可在https://github.com/joeljang/music2video上找到。

Creation of images using generative adversarial networks has been widely adapted into multi-modal regime with the advent of multi-modal representation models pre-trained on large corpus. Various modalities sharing a common representation space could be utilized to guide the generative models to create images from text or even from audio source. Departing from the previous methods that solely rely on either text or audio, we exploit the expressiveness of both modality. Based on the fusion of text and audio, we create video whose content is consistent with the distinct modalities that are provided. A simple approach to automatically segment the video into variable length intervals and maintain time consistency in generated video is part of our method. Our proposed framework for generating music video shows promising results in application level where users can interactively feed in music source and text source to create artistic music videos. Our code is available at https://github.com/joeljang/music2video.

下载PDF全文

下载文献需遵守相关版权规定

论文标题