Not All Features are Equal: Selection of Robust Features for Speech Emotion Recognition in Noisy Environments

Seong-Gyun Leem,Daniel Fulford,Carlos Busso,David Gard,Jukka-Pekka Onnela

doi:10.1109/icassp43922.2022.9747705

Seong-Gyun Leem, Daniel Fulford + Show 3 more

Open Access

https://doi.org/10.1109/icassp43922.2022.9747705

Copy DOI

Abstract

Speech emotion recognition (SER) system deployed in real-world applications often encounters noisy speech. While most noise compensation techniques consider all acoustic features to have equal impact on the SER model, some acoustic features may be more sensitive to noisy conditions. This paper investigates the noise robustness of each feature in the acoustic feature set. We focus on low-level descriptors (LLDs) commonly used in SER systems. We firstly train SER models with clean speech by only using a single LLD. Then, we rank each LLD with respect to the absolute performance on a development set contaminated with noise, and the relative performance decrease from the results from the models trained with the clean set. Our experiment shows that using all the LLDs leads to worse performance than training the system with a single robust LLD. We propose to select a group of robust features according to their performance and robustness in noisy condition. Without using any compensation method, our feature selection methods improve the performance by 24.4% (arousal), 23.9% (dominance), and 43.2% (valence) in the 10dB noisy condition. Moreover, even though the selection is conducted with the 10dB condition, our selection methods also yield performance improvements in unseen noisy recording conditions.

Full Text