使用自我监督的功能识别语音情感

论文标题

使用自我监督的功能识别语音情感

Speech Emotion Recognition using Self-Supervised Features

论文作者

Morais, Edmilson, Hoory, Ron, Zhu, Weizhong, Gat, Itai, Damasceno, Matheus, Aronowitz, Hagai

论文摘要

自我监督的预训练特征一直在自然语言处理领域（NLP）始终提供最先进的功能；但是，他们在言语情感识别（SER）领域的优点仍然需要进一步调查。在本文中，我们基于上游 +下游体系结构范式介绍了模块化的端到端（E2E）SER系统，该系统允许轻松使用/集成各种自我监督的功能。进行了几项用于预测IeMocap数据集中分类情绪类别的SER实验。这些实验研究了自我监督特征模型的微调之间的相互作用，将框架级特征聚集到发声级特征和后端分类网络之间的相互作用。拟议的基于语音的系统不仅取得了SOTA的结果，而且还带来了强大且精心挑战的自我监督的声学特征的可能性，与SOTA多模态系统使用语音和文本方式相似的结果相似。

Self-supervised pre-trained features have consistently delivered state-of-art results in the field of natural language processing (NLP); however, their merits in the field of speech emotion recognition (SER) still need further investigation. In this paper we introduce a modular End-to- End (E2E) SER system based on an Upstream + Downstream architecture paradigm, which allows easy use/integration of a large variety of self-supervised features. Several SER experiments for predicting categorical emotion classes from the IEMOCAP dataset are performed. These experiments investigate interactions among fine-tuning of self-supervised feature models, aggregation of frame-level features into utterance-level features and back-end classification networks. The proposed monomodal speechonly based system not only achieves SOTA results, but also brings light to the possibility of powerful and well finetuned self-supervised acoustic features that reach results similar to the results achieved by SOTA multimodal systems using both Speech and Text modalities.

下载PDF全文

下载文献需遵守相关版权规定

论文标题