论文标题
修订:自我监督的语音重新合成,并通过视觉输入进行通用和普遍的语音增强
ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement
论文作者
论文摘要
以视觉输入改善语音质量的先前工作通常会分别研究每种类型的听觉失真(例如,分离,内置,视频到语音)和当前的量身定制算法。本文提议统一这些主题并研究广义语音增强,其目标不是重建确切的参考清洁信号,而是专注于改善语音的某些方面。特别是,本文涉及清晰度,质量和视频同步。我们将这个问题作为视听语音重新合成,由两个步骤组成:伪音频 - 视觉语音识别(P-AVSR)(P-AVSR)和伪文本到语音综合(P-TTS)。 P-AVSR和P-TTs通过自我监督语音模型得出的离散单位连接。此外,我们利用自我监管的音频语音模型来初始化P-AVSR。提出的模型是创造的。修订是野外视频到语音综合的第一个高质量模型,并通过单个模型在所有LRS3 Audio-Visual-Visual增强任务上实现了卓越的性能。为了证明其在现实世界中的适用性,还对easycom进行了评估,这是一种在挑战性的声学条件下收集的视听基准,仅1.6小时的培训数据。同样,修订会极大地抑制噪声并提高质量。项目页面:https://wnhsu.github.io/revise。
Prior works on improving speech quality with visual input typically study each type of auditory distortion separately (e.g., separation, inpainting, video-to-speech) and present tailored algorithms. This paper proposes to unify these subjects and study Generalized Speech Enhancement, where the goal is not to reconstruct the exact reference clean signal, but to focus on improving certain aspects of speech. In particular, this paper concerns intelligibility, quality, and video synchronization. We cast the problem as audio-visual speech resynthesis, which is composed of two steps: pseudo audio-visual speech recognition (P-AVSR) and pseudo text-to-speech synthesis (P-TTS). P-AVSR and P-TTS are connected by discrete units derived from a self-supervised speech model. Moreover, we utilize self-supervised audio-visual speech model to initialize P-AVSR. The proposed model is coined ReVISE. ReVISE is the first high-quality model for in-the-wild video-to-speech synthesis and achieves superior performance on all LRS3 audio-visual enhancement tasks with a single model. To demonstrates its applicability in the real world, ReVISE is also evaluated on EasyCom, an audio-visual benchmark collected under challenging acoustic conditions with only 1.6 hours of training data. Similarly, ReVISE greatly suppresses noise and improves quality. Project page: https://wnhsu.github.io/ReVISE.