Abstract
Traditional speech emotion recognition (SER) models predominantly utilize cross-entropy loss for supervised optimization. Yet, given the limited size of SER datasets, relying solely on cross-entropy loss does not adequately capture the intricacies within the data. Moreover, conventional contrastive learning techniques, though improving performance, do not completely harness the inherent data structure across samples. To surmount these limitations, we introduce an innovative emotion recognition network founded on focused contrastive learning. Our model employs bidirectional gated recurrent unit (GRU) networks to seize critical contextual dependencies. By integrating an attention mechanism, we adeptly merge the contextual features of dialogues with individual utterance attributes, forming a comprehensive multi-stage feature fusion framework. This strategy extracts more elaborate emotional features from the constrained dataset. Additionally, by amalgamating the focal loss function into our training protocol, we mitigate imbalanced sample issues to a degree. We further refine our contrastive loss function by enacting a weighted strategy for intra-dialogue category assignments, where higher penalty weights are ascribed to harder-to-identify categories, thus incentivizing the model to concentrate on distinguishing subtler category differences. This refinement significantly bolsters the model’s proficiency in learning various emotional feature distributions, thereby enhancing its emotion recognition capabilities. Experimental validation on IEMOCAP and MELD benchmark datasets underpins our model’s superiority.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have