Abstract

Automatic speech emotion recognition (SER) has gained popularity over the last decade and numerous Challenges have emerged. While the latest Challenges have shown that deep neural networks achieve the best results, existing input features are still a bottleneck and cause severe performance degradation in realistic “in-the-wild” scenarios. In this paper, we propose two innovations to tackle this issue. First, we propose to combine the bag-of-audio-words methodology with modulation spectrum features for environmental robustness. Second, we take advantage of the inherent quality-awareness properties of modulation spectrum and propose the use of a quality feature as an additional feature to be used by the speech emotion recognizer. Experiments are conducted with three multi-lingual speech datasets used in recent SER Challenges degraded by different noise sources and levels, and room reverberation. Experimental results show the proposed features i) consistently outperforming benchmark systems, ii) providing complementary information to classical features, hence improving performance with feature fusion, and iii) showing robustness against environment and language mismatch. Moreover, we show that when the proposed system is provided with quality information, further improvements are obtained. Overall, the proposed bag of modulation spectrum features are shown to be a promising candidate for “in-the-wild” SER.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call