Abstract

Speech emotion recognition (SER) helps achieve better human-computer interaction and thus has attracted extensive attention from industry and academia. Speech emotion intensity plays an important role in the emotional description, but its effect on emotion recognition still has been rarely studied in the area of SER to the best of our knowledge. Previous studies have shown that there is a certain relationship between speech emotion intensity and emotion category, so each recognition task of multi-task learning is supposed to be beneficial to each other. We propose a multi-task learning framework with a self-supervised speech representation extractor based on Wav2Vec 2.0 to detect speech emotion and intensity at the same time in downstream networks. Experiment results show that the multi-task learning framework outperforms SOTA SER models and achieves 5% and 7% SER performance improvement on IEMOCAP and RAVDESS thanks to the auxiliary task of emotion intensity recognition.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call