说话者表示通过对比度损失与最大扬声器可分离性学习

论文标题

说话者表示通过对比度损失与最大扬声器可分离性学习

Speaker Representation Learning via Contrastive Loss with Maximal Speaker Separability

论文作者

Li, Zhe, Mak, Man-Wai

论文摘要

使用深层模型的说话者表示学习的巨大挑战是设计学习目标，以增强看不见的域下看不见的说话者的歧视。这项工作提出了一个有监督的对比学习目标，可以通过有效利用培训数据中的标签信息来学习说话者嵌入空间。在这样的空间中，相同或类似扬声器说的话语对话将保持近距离，而不同扬声器说的话语对将是很远的。对于每个培训演讲者，我们对其话语进行随机数据增强，以形成积极的对，而来自不同扬声器的话语形成负面对。为了最大程度地提高嵌入空间中的说话者可分离性，我们将添加剂的角度损失纳入对比度学习目标中。 CN-CELEB的实验结果表明，这个新的学习目标可能会导致Ecapa-TDNN找到具有出色扬声器歧视的嵌入式空间。对比度学习目标易于实现，我们在https://github.com/shanmon110/aamsupcon上提供Pytorch代码。

A great challenge in speaker representation learning using deep models is to design learning objectives that can enhance the discrimination of unseen speakers under unseen domains. This work proposes a supervised contrastive learning objective to learn a speaker embedding space by effectively leveraging the label information in the training data. In such a space, utterance pairs spoken by the same or similar speakers will stay close, while utterance pairs spoken by different speakers will be far apart. For each training speaker, we perform random data augmentation on their utterances to form positive pairs, and utterances from different speakers form negative pairs. To maximize speaker separability in the embedding space, we incorporate the additive angular-margin loss into the contrastive learning objective. Experimental results on CN-Celeb show that this new learning objective can cause ECAPA-TDNN to find an embedding space that exhibits great speaker discrimination. The contrastive learning objective is easy to implement, and we provide PyTorch code at https://github.com/shanmon110/AAMSupCon.

下载PDF全文

下载文献需遵守相关版权规定

论文标题