基于注意力的后端，允许对变压器模型进行有效的微调，以进行扬声器验证

论文标题

基于注意力的后端，允许对变压器模型进行有效的微调，以进行扬声器验证

An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification

论文作者

Peng, Junyi, Plchot, Oldrich, Stafylakis, Themos, Mosner, Ladislav, Burget, Lukas, Cernocky, Jan

论文摘要

近年来，由于其在各种下游任务中取得了巨大的成功，因此自我监督的学习范式受到了广泛的关注。但是，将这些预先训练的模型适应说话者验证任务的微调策略尚未得到充分探索。在本文中，我们分析了基于预先训练的模型之上的几种特征提取方法，以及正规化和学习率计划，以稳定微调过程并进一步提高性能：提出了多头分解的细心池，以将扬声器表现形式的比较分解为多个语音簇。我们将预先训练模型的参数正规化，并在微调过程中为预训练模型的每一层设置不同的学习率。实验结果表明，我们的方法可以将训练时间大大缩短到4小时，并实现SOTA性能：Vox1-O，Vox1-E和Vox1-H分别为0.59％，0.79％和1.77％EER。

In recent years, self-supervised learning paradigm has received extensive attention due to its great success in various down-stream tasks. However, the fine-tuning strategies for adapting those pre-trained models to speaker verification task have yet to be fully explored. In this paper, we analyze several feature extraction approaches built on top of a pre-trained model, as well as regularization and learning rate schedule to stabilize the fine-tuning process and further boost performance: multi-head factorized attentive pooling is proposed to factorize the comparison of speaker representations into multiple phonetic clusters. We regularize towards the parameters of the pre-trained model and we set different learning rates for each layer of the pre-trained model during fine-tuning. The experimental results show our method can significantly shorten the training time to 4 hours and achieve SOTA performance: 0.59%, 0.79% and 1.77% EER on Vox1-O, Vox1-E and Vox1-H, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题