从对比度混合物中学习的自我监督学习，以增强个性化语音

论文标题

从对比度混合物中学习的自我监督学习，以增强个性化语音

Self-Supervised Learning from Contrastive Mixtures for Personalized Speech Enhancement

论文作者

Sivaraman, Aswin, Kim, Minje

论文摘要

这项工作探讨了如何普遍使用自我监督的学习，以发现特定于扬声器的特征以实现个性化的语音增强模型。我们专门解决了几个弹奏的学习方案，其中访问测试时间扬声器的清洁录音仅限几秒钟，但演讲者的嘈杂录音很丰富。我们开发了一个简单的对比度学习程序，该程序通过成对噪声注入将丰富的嘈杂数据视为临时训练目标：该模型经过审议，以最大程度地达成不同变形相同的话语对之间的一致性，并最大程度地减少了类似变形的非同一性的非同一性异常说法之间的一致性。我们的实验将所提出的预训练方法与两种基线替代方法进行了比较：说话者无义者完全监督的预处理，以及特定于扬声器的自我监督预审计预定训练，而没有对比损失项。在所有三种方法中，使用对比度混合物的提出方法最适合模型压缩（使用少85％的参数）和简洁的言语（仅需要3秒）。

This work explores how self-supervised learning can be universally used to discover speaker-specific features towards enabling personalized speech enhancement models. We specifically address the few-shot learning scenario where access to cleaning recordings of a test-time speaker is limited to a few seconds, but noisy recordings of the speaker are abundant. We develop a simple contrastive learning procedure which treats the abundant noisy data as makeshift training targets through pairwise noise injection: the model is pretrained to maximize agreement between pairs of differently deformed identical utterances and to minimize agreement between pairs of similarly deformed nonidentical utterances. Our experiments compare the proposed pretraining approach with two baseline alternatives: speaker-agnostic fully-supervised pretraining, and speaker-specific self-supervised pretraining without contrastive loss terms. Of all three approaches, the proposed method using contrastive mixtures is found to be most robust to model compression (using 85% fewer parameters) and reduced clean speech (requiring only 3 seconds).

下载PDF全文

下载文献需遵守相关版权规定

论文标题