视觉匹配

论文标题

Visual Acoustic Matching

论文作者

Chen, Changan, Gao, Ruohan, Calamia, Paul, Grauman, Kristen

论文摘要

我们介绍了视觉匹配任务，其中音频夹听起来像在目标环境中录制。鉴于目标环境的图像和源音频的波形，目标是重新合成音频，以匹配目标室声音的可见几何形状和材料所建议的。为了解决这一新颖的任务，我们提出了一个跨模式变压器模型，该模型使用视听注意力将视觉属性注入音频并生成真实的音频输出。此外，我们设计了一个自制的训练目标，尽管他们缺乏声学上的音频，但可以从野外网络视频中学习声学匹配。我们证明，我们的方法成功地将人类的言语转化为图像中描绘的各种现实环境，表现优于传统的声学匹配和更严格监督的基线。

We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment. Given an image of the target environment and a waveform for the source audio, the goal is to re-synthesize the audio to match the target room acoustics as suggested by its visible geometry and materials. To address this novel task, we propose a cross-modal transformer model that uses audio-visual attention to inject visual properties into the audio and generate realistic audio output. In addition, we devise a self-supervised training objective that can learn acoustic matching from in-the-wild Web videos, despite their lack of acoustically mismatched audio. We demonstrate that our approach successfully translates human speech to a variety of real-world environments depicted in images, outperforming both traditional acoustic matching and more heavily supervised baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题

视觉匹配

Visual Acoustic Matching

论文作者

论文摘要

加入微信交流群