章节：用于自我监督语音模型的卷积神经网络适配器

论文标题

章节：用于自我监督语音模型的卷积神经网络适配器

CHAPTER: Exploiting Convolutional Neural Network Adapters for Self-supervised Speech Models

论文作者

Chen, Zih-Ching, Sung, Yu-Shun, Lee, Hung-yi

论文摘要

自我监督学习（SSL）是一种从未标记的数据学习表示形式的强大技术。基于变压器的模型（例如Hubert）组成了特征提取器和变压器层，它在语音域中引领了该领域。 SSL模型在各种下游任务上进行了微调，这涉及重新训练每个任务的大多数模型。先前的研究介绍了应用适配器，该适配器是自然语言处理（NLP）通常使用的小型轻量级模块，以使预训练的模型适应新任务。但是，这种有效的调整技术仅在变压器层提供适应性，但未能在特征提取器上进行适应。在本文中，我们提出了章节，这是一种通过在功能提取器上应用CNN适配器，是专门为SSL语音模型设计的有效调整方法。使用这种方法，与完全微调相比，我们只能微调每个任务的5％参数，并取得更好，更稳定的性能。我们从经验上发现，将CNN适配器添加到功能提取器中可以帮助适应情感和扬声器任务。例如，SID的准确性从87.71提高到91.56，ER的精度提高了5％。

Self-supervised learning (SSL) is a powerful technique for learning representations from unlabeled data. Transformer based models such as HuBERT, which consist a feature extractor and transformer layers, are leading the field in the speech domain. SSL models are fine-tuned on a wide range of downstream tasks, which involves re-training the majority of the model for each task. Previous studies have introduced applying adapters, which are small lightweight modules commonly used in Natural Language Processing (NLP) to adapt pre-trained models to new tasks. However, such efficient tuning techniques only provide adaptation at the transformer layer, but failed to perform adaptation at the feature extractor. In this paper, we propose CHAPTER, an efficient tuning method specifically designed for SSL speech model, by applying CNN adapters at the feature extractor. Using this method, we can only fine-tune fewer than 5% of parameters per task compared to fully fine-tuning and achieve better and more stable performance. We empirically found that adding CNN adapters to the feature extractor can help the adaptation on emotion and speaker tasks. For instance, the accuracy of SID is improved from 87.71 to 91.56, and the accuracy of ER is improved by 5%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题