Multi-task Learning for Speech Emotion and Emotion Intensity Recognition

Pengcheng Yue,Taihao Li,Shukai Zheng,Leyuan Qu

doi:10.23919/apsipaasc55919.2022.9979844

Abstract

Speech emotion recognition (SER) helps achieve better human-computer interaction and thus has attracted extensive attention from industry and academia. Speech emotion intensity plays an important role in the emotional description, but its effect on emotion recognition still has been rarely studied in the area of SER to the best of our knowledge. Previous studies have shown that there is a certain relationship between speech emotion intensity and emotion category, so each recognition task of multi-task learning is supposed to be beneficial to each other. We propose a multi-task learning framework with a self-supervised speech representation extractor based on Wav2Vec 2.0 to detect speech emotion and intensity at the same time in downstream networks. Experiment results show that the multi-task learning framework outperforms SOTA SER models and achieves 5% and 7% SER performance improvement on IEMOCAP and RAVDESS thanks to the auxiliary task of emotion intensity recognition.

Full Text