Abstract

Speech emotion recognition (SER) system deployed in real-world applications often encounters noisy speech. While most noise compensation techniques consider all acoustic features to have equal impact on the SER model, some acoustic features may be more sensitive to noisy conditions. This paper investigates the noise robustness of each feature in the acoustic feature set. We focus on low-level descriptors (LLDs) commonly used in SER systems. We firstly train SER models with clean speech by only using a single LLD. Then, we rank each LLD with respect to the absolute performance on a development set contaminated with noise, and the relative performance decrease from the results from the models trained with the clean set. Our experiment shows that using all the LLDs leads to worse performance than training the system with a single robust LLD. We propose to select a group of robust features according to their performance and robustness in noisy condition. Without using any compensation method, our feature selection methods improve the performance by 24.4% (arousal), 23.9% (dominance), and 43.2% (valence) in the 10dB noisy condition. Moreover, even though the selection is conducted with the 10dB condition, our selection methods also yield performance improvements in unseen noisy recording conditions.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call