论文标题
使用自我监督的功能识别语音情感
Speech Emotion Recognition using Self-Supervised Features
论文作者
论文摘要
自我监督的预训练特征一直在自然语言处理领域(NLP)始终提供最先进的功能;但是,他们在言语情感识别(SER)领域的优点仍然需要进一步调查。在本文中,我们基于上游 +下游体系结构范式介绍了模块化的端到端(E2E)SER系统,该系统允许轻松使用/集成各种自我监督的功能。进行了几项用于预测IeMocap数据集中分类情绪类别的SER实验。这些实验研究了自我监督特征模型的微调之间的相互作用,将框架级特征聚集到发声级特征和后端分类网络之间的相互作用。拟议的基于语音的系统不仅取得了SOTA的结果,而且还带来了强大且精心挑战的自我监督的声学特征的可能性,与SOTA多模态系统使用语音和文本方式相似的结果相似。
Self-supervised pre-trained features have consistently delivered state-of-art results in the field of natural language processing (NLP); however, their merits in the field of speech emotion recognition (SER) still need further investigation. In this paper we introduce a modular End-to- End (E2E) SER system based on an Upstream + Downstream architecture paradigm, which allows easy use/integration of a large variety of self-supervised features. Several SER experiments for predicting categorical emotion classes from the IEMOCAP dataset are performed. These experiments investigate interactions among fine-tuning of self-supervised feature models, aggregation of frame-level features into utterance-level features and back-end classification networks. The proposed monomodal speechonly based system not only achieves SOTA results, but also brings light to the possibility of powerful and well finetuned self-supervised acoustic features that reach results similar to the results achieved by SOTA multimodal systems using both Speech and Text modalities.