通过交叉任务知识蒸馏打破个性化语音增强的权衡

论文标题

通过交叉任务知识蒸馏打破个性化语音增强的权衡

Breaking the trade-off in personalized speech enhancement with cross-task knowledge distillation

论文作者

Taherian, Hassan, Eskimez, Sefik Emre, Yoshioka, Takuya

论文摘要

个性化的语音增强（PSE）模型与无条件的语音增强模型相比，由于其除了背景噪声之外可以消除干扰语音的能力，因此获得了有希望的结果。与无条件的语音增强不同，因果PSE模型有时可能会错误地消除目标语音。当目标发言人长时间保持沉默时，PSE模型还倾向于泄漏干扰语音。我们表明，现有的PSE方法通过以牺牲另一种问题为代价来解决一个问题，从而在语音过度抑制和干扰泄漏之间取消了权衡。我们建议使用交叉任务知识蒸馏提出一个新的PSE模型培训框架，以减轻这种权衡。具体来说，我们在培训期间使用个性化的语音活动探测器（PVAD）来排除错误识别为包含具有硬或软分类的目标扬声器的非目标语音框架。这样可以防止PSE模型过于侵略性，同时仍允许该模型在干扰扬声器可能会说出输入语音时学会抑制它。提出了全面的评估结果，涵盖了各种PSE使用情况。

Personalized speech enhancement (PSE) models achieve promising results compared with unconditional speech enhancement models due to their ability to remove interfering speech in addition to background noise. Unlike unconditional speech enhancement, causal PSE models may occasionally remove the target speech by mistake. The PSE models also tend to leak interfering speech when the target speaker is silent for an extended period. We show that existing PSE methods suffer from a trade-off between speech over-suppression and interference leakage by addressing one problem at the expense of the other. We propose a new PSE model training framework using cross-task knowledge distillation to mitigate this trade-off. Specifically, we utilize a personalized voice activity detector (pVAD) during training to exclude the non-target speech frames that are wrongly identified as containing the target speaker with hard or soft classification. This prevents the PSE model from being too aggressive while still allowing the model to learn to suppress the input speech when it is likely to be spoken by interfering speakers. Comprehensive evaluation results are presented, covering various PSE usage scenarios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题