Abstract

Data scarcity and speech degradation due to environmental noise are two significant issues in the modelling and deployment speech emotion recognition (SER) systems. Deep learning-based SER systems overfits during modelling because of scarce training samples. Although recent attempts to tackle these issues, simultaneously, using data augmentation have yielded promising results, they are not robust enough to handle speech degradation due to real environmental noise. Thus, there is the need to further improve the classification performance of deployed SER systems. This work proposes an SER system based on a novel robust multi-window spectrogram augmentation (RMWSaug) scheme and, transfer learning to handle these aforementioned issues simultaneously. First, the RMWSaug scheme utilizes the concept of multi-window and multi-noise conditioning of clean speech samples to create additional speech spectrograms required for training. Then, pretrained networks are adapted for speech emotion recognition and finetuned with the generated training datasets to develop a model robust to speech degradation due to noise. Thereby, improving the classification performance in the wild. The Interactive Emotional Dyadic Motion Capture (IEMOCAP) database was selected as benchmark dataset for evaluating the proposed SER system. Experimental results show that the proposed SER system outperformed existing methods when deployed in the wild. The proposed SER system can be deployed to predict the emotions of speakers conversing virtually on online platforms.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call