基于音节级特征提取的实时语音情感识别

论文标题

基于音节级特征提取的实时语音情感识别

Real-time Speech Emotion Recognition Based on Syllable-Level Feature Extraction

论文作者

Rehman, Abdul, Liu, Zhen-Tao, Wu, Min, Cao, Wei-Hua, Jiang, Cheng-Shan

论文摘要

语音情绪识别系统具有很高的预测潜伏期，因为对深度学习模型的计算要求很高，并且可推广性较低，这主要是由于多个语料库的情绪测量的可靠性差。为了解决这些问题，我们基于分解和分析音节级特征的简化主义方法提出了语音情感识别系统。音频流的MEL光谱图被分解为音节级成分，然后对其进行分析以提取统计特征。所提出的方法使用共振剂的注意力，噪声门滤波和滚动归一化环境来提高特征处理速度和对逆境的耐受性。将一组音节级实件特征提取并馈入单个隐藏的层神经网络中，这对每个音节进行了预测，而不是使用复杂的深度学习者进行整个句子的预测的常规方法。音节级别的预测有助于实现实时延迟，并降低话语级别跨核心预测中的总误差。在IEMOCAP（IE），MSP-IMPROV（MI）和RAVDESS（RA）数据库上进行的实验表明，该方法可以将实时延迟归档，同时使用最先进的交叉盘列未加权的跨核心准确性为47.6％，为IE到MI，MI到IE到IE，为56.2％。

Speech emotion recognition systems have high prediction latency because of the high computational requirements for deep learning models and low generalizability mainly because of the poor reliability of emotional measurements across multiple corpora. To solve these problems, we present a speech emotion recognition system based on a reductionist approach of decomposing and analyzing syllable-level features. Mel-spectrogram of an audio stream is decomposed into syllable-level components, which are then analyzed to extract statistical features. The proposed method uses formant attention, noise-gate filtering, and rolling normalization contexts to increase feature processing speed and tolerance to adversity. A set of syllable-level formant features is extracted and fed into a single hidden layer neural network that makes predictions for each syllable as opposed to the conventional approach of using a sophisticated deep learner to make sentence-wide predictions. The syllable level predictions help to achieve the real-time latency and lower the aggregated error in utterance level cross-corpus predictions. The experiments on IEMOCAP (IE), MSP-Improv (MI), and RAVDESS (RA) databases show that the method archives real-time latency while predicting with state-of-the-art cross-corpus unweighted accuracy of 47.6% for IE to MI and 56.2% for MI to IE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题