Abstract

Speech Emotion Recognition (SER) is an important task since emotion is a primary dimension in human communication and health. It has a wide variety of practical applications such as assessing the mood of callers to an emergency call center and as a diagnostic tool for therapists. Since most of the SER models in the literature are trained with clean noiseless data and non noise-robust input features, they are not very useful in real world conditions where noise is almost always present. Although there are methods to reduce these adverse effects of noise, a systematic analysis of these methods in the context of SER is lacking. In this paper several different methods to mitigate adverse effects of noise on CNN (Convolutional Neural Network) based SER are developed and analyzed. The SER models trained on the Berlin Database of Emotional Speech were tested with clean data and data mixed with 10 different noise types at different noise levels with signal to noise ratios of 10,15,20,25,30 and 35. We show that the noise robustness of SER models can be improved by combining the magnitude spectrogram with the modified group delay spectrogram, by including synthetic noise in the training data, and by using an attention mechanism. When trained with noisy data, the models trained with the combined input saw a 10% increase in average accuracy than the models using individual inputs and not trained with noise. Adding the attention mechanism to the previous model further improved the accuracy by 5%. Finally, by training and evaluating on the RAVDESS dataset, we demonstrated that the noise robust methods developed can be generalized into other datasets and emotions. We achieved an average accuracy of 81% on RAVDESS dataset under noisy conditions.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.