Speech emotion recognition has lots of applications such as human-computer interaction and health management. The current methods are challenged with the problems of fuzzy decision boundary and imbalance between difficult and easy samples in the training data. This paper first proposes an additive angle penalty focus loss function (APFL), which strictly refines the fuzzy decision boundary by introducing angle penalty factors to improve the compactness within the class and enlarge the distance between classes. It also assigns the larger loss to difficult samples to make the model pay more attention to them, as they are easily misclassified. Simultaneously, due to the lack of training samples, the framework of multimodal and multitask learning with APFL is further proposed, which extracts spectrogram features by deep neural network, text features by the pretrained language model, and audio features by the pretrained sound model. It uses the gender recognition as an auxiliary task. The experimental results verify the effectiveness of the proposed loss function and framework.
Read full abstract