Abstract

A speech emotion recognition (SER) model for noisy environments is proposed, by using four band-pass filtered speech waveforms as the model input instead of the simplified input features such as MFCC (Mel Frequency Cepstral Coefficients). The four waveforms retain the entire information of the original noisy speech while the simplified features keep only partial information of the noisy speech. The information reduction at the model input may cause the accuracy degradation under noisy environments. A normalized loss function is used for training to maintain the high-frequency details of the original noisy speech waveform. A multi-decoder Wave-U-Net model is used to perform the denoising operation and the Wave-U-Net output waveform is applied to an emotion classifier in this work. By this, the number of parameters is reduced to 2.8 M for inference from 4.2 M used for training. The Wave-U-Net model consists of an encoder, a 2-layer LSTM, six decoders, and skip-nets; out of the six decoders, four decoders are used for denoising four band-pass filtered waveforms, one decoder is used for denoising the pitch-related waveform, and one decoder is used to generate the emotion classifier input waveform. This work gives much less accuracy degradation than other SER works under noisy environments; compared to accuracy for the clean speech waveform, the accuracy degradation is 3.8% at 0 dB SNR in this work while it is larger than 15% in the other SER works. The accuracy degradation of this work at SNRs of 0 dB, −3 dB, and −6 dB is 3.8%, 5.2%, and 7.2%, respectively.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.