丽莎：通过隐式神经表示，带有音频的局部图像样式化

论文标题

丽莎：通过隐式神经表示，带有音频的局部图像样式化

LISA: Localized Image Stylization with Audio via Implicit Neural Representation

论文作者

Lee, Seung Hyun, Kim, Chanyoung, Byeon, Wonmin, Yoon, Sang Ho, Kim, Jinkyu, Kim, Sangpil

论文摘要

我们提出了一个具有音频（LISA）的新型框架，本地化的图像样式化，该框架执行音频驱动的局部图像样式化。声音通常提供有关场景特定上下文的信息，并与场景或对象的某个部分密切相关。但是，现有的图像样式化工作重点是使用图像或文本输入来对整个图像进行样式化。根据音频输入对图像的特定部分进行样式化是自然的，但具有挑战性。在这项工作中，我们提出了一个框架，用户提供了一个音频输入，以将声源定位在输入映像中，而另一个用于本地对目标对象或场景进行样式化的样式。丽莎首先通过利用夹具嵌入空间来生成带有视听本地化网络的精致定位图。然后，我们利用隐式神经表示（INR）以及预测的本地化图来根据声音信息对目标对象或场景进行样式化。所提出的INR可以操纵局部像素值在语义上与所提供的音频输入一致。通过一系列实验，我们表明所提出的框架的表现优于其他音频引导的风格化方法。此外，丽莎构造简洁的本地化图，并自然地根据给定音频输入来操纵目标对象或场景。

We present a novel framework, Localized Image Stylization with Audio (LISA) which performs audio-driven localized image stylization. Sound often provides information about the specific context of the scene and is closely related to a certain part of the scene or object. However, existing image stylization works have focused on stylizing the entire image using an image or text input. Stylizing a particular part of the image based on audio input is natural but challenging. In this work, we propose a framework that a user provides an audio input to localize the sound source in the input image and another for locally stylizing the target object or scene. LISA first produces a delicate localization map with an audio-visual localization network by leveraging CLIP embedding space. We then utilize implicit neural representation (INR) along with the predicted localization map to stylize the target object or scene based on sound information. The proposed INR can manipulate the localized pixel values to be semantically consistent with the provided audio input. Through a series of experiments, we show that the proposed framework outperforms the other audio-guided stylization methods. Moreover, LISA constructs concise localization maps and naturally manipulates the target object or scene in accordance with the given audio input.

下载PDF全文

下载文献需遵守相关版权规定

论文标题