Joint enhancement and classification constraints for noisy speech emotion recognition

Linhui Sun,Shun Wang,Min Zhao,Pingan Li,Shuaitong Chen,Yunlong Lei

doi:10.1016/j.dsp.2024.104581

Abstract

In the natural environment, the received speech signal is often interfered by noise, which reduces the performance of speech emotion recognition (SER) system. To this end, a noisy SER method based on joint constraints, including enhancement constraint and arousal-valence classification constraint (EC-AVCC), is proposed. This method introduces speech enhancement features and constraints into emotion recognition for the first time, which helps to improve the performance of noisy SER. It does not require denoising pre-processing for noisy speech, thus saving testing time. Specifically, it uses multi-domain statistical features (MDSF) composed of emotional and enhanced features as the input of the SER model based on convolution neural network (CNN) and long short-term memory-attention (ALSTM). In addition, the model is constrained jointly by speech enhancement and arousal-valence classification to obtain the robust and discriminative deep emotion features. Furthermore, in the auxiliary speech enhancement task, a joint loss function that simultaneously constrains the error of the ideal ratio mask and the error of the corresponding MDSF is introduced to obtain more robust features. The experimental results on the CASIA Chinese emotional dataset and the EMO-DB German emotional database show that compared with the CNN baseline, the proposed method improves the accuracy of SER in white noise and babble noise by 4.7%-9.9%.

Full Text