In the conventional approach to speech emotion recognition (SER), the classifier is usually trained on acted emotional speech data to predict individual basic emotions. In this work, we extend the SER systems with the realistic assumption of the coexistence of multiple basic emotions in an utterance. We utilize the MSP-Podcast database for developing the SER system, which contains spontaneous speech utterances. From the primary and secondary emotion annotations of this database, we organize six prototypical and seven non-prototypical emotions. We then propose the mixed-emotion model and express the non-prototypical emotions as a linear combination of prototypical emotions. The combination weights of the mixed-emotion model are computed using a normalized dominance scale based algorithm inspired by the integration of basic emotion theory and dimensional emotion theory in human psychology. We first train a prototypical SER model using the ECAPA-TDNN architecture. The softmax predictions from this model serve as emotion profile inputs to the mixed-emotion model, which then predicts the non-prototypical emotions. Assuming the coexistence of multiple emotions, we only apply the utterances with uniform emotion profiles to the mixed-emotion model. The developed system continues with the conventional SER model if the emotion profile tends to delta function owing to the probable occurrence of a single prototypical emotion. We develop the proposed mixed-emotion model based SER framework using MFCC and wav2vec 2.0 extracted features. Further, we show that due to human perception variations, there exist prominent annotation variations in the non-prototypical emotion ground truths. To address that, we extend the supervised evaluation protocols in four different formulations that capture the subjective variability at different levels. The proposed system shows a best-case performance improvement of 7.10% and 8.39% over the conventional prototypical SER model for the MFCC and wav2vec 2.0 features, respectively.
Read full abstract