Speech Emotion Recognition Using Dual-Stream Representation and Cross-Attention Fusion

Shaode Yu,Ye Chen,Jiajian Meng,Yaoqin Xie,Wenqing Fan,Qiuirui Sun,Bing Zhu,Hang Yu

doi:10.3390/electronics13112191

Abstract

Speech emotion recognition (SER) aims to recognize human emotions through in-depth analysis of audio signals. However, it remains challenging to encode emotional cues and to fuse the encoded cues effectively. In this study, dual-stream representation is developed, and both full training and fine-tuning of different deep networks are employed for encoding emotion patterns. Specifically, a cross-attention fusion (CAF) module is designed to integrate the dual-stream output for emotion recognition. Using different dual-stream encoders (fully training a text processing network and fine-tuning a pre-trained large language network), the CAF module is compared to other three fusion modules on three databases. The SER performance is quantified with weighted accuracy (WA), unweighted accuracy (UA), and F1-score (F1S). The experimental results suggest that the CAF outperforms the other three modules and leads to promising performance on the databases (EmoDB: WA, 97.20%; UA, 97.21%; F1S, 0.8804; IEMOCAP: WA, 69.65%; UA, 70.88%; F1S, 0.7084; RAVDESS: WA, 81.86%; UA, 82.75.21%; F1S, 0.8284). It is also found that fine-tuning a pre-trained large language network achieves superior representation than fully training a text processing network. In a future study, improved SER performance could be achieved through the development of a multi-stream representation of emotional cues and the incorporation of a multi-branch fusion mechanism for emotion recognition.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Electronics	Publication Date: Jun 4, 2024
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Speech Emotion Recognition Using Dual-Stream Representation and Cross-Attention Fusion

Abstract

Talk to us

Similar Papers

More From: Electronics

Lead the way for us

Similar Papers

Speech Emotion Recognition Using Transfer Learning: Integration of Advanced Speaker Embeddings and Image Recognition Models
Maros Jakubec ... Peter Kasak
Applied Sciences | VOL. 14
Maros Jakubec, et. al.Maros Jakubec ... Peter Kasak
31 Oct 2024
Applied Sciences | VOL. 14

Graph based emotion recognition with attention pooling for variable-length utterances
Jiawang Liu ... Yao Wei
Neurocomputing | VOL. 496
Jiawang Liu, et. al.Jiawang Liu ... Yao Wei
06 May 2022
Neurocomputing | VOL. 496

BAT: Block and token self-attention for speech emotion recognition
Jianjun Lei ... Ying Wang
Neural Networks | VOL. 156
Jianjun Lei, et. al.Jianjun Lei ... Ying Wang
29 Sep 2022
Neural Networks | VOL. 156

Combining a parallel 2D CNN with a self-attention Dilated Residual Network for CTC-based discrete speech emotion recognition
Ziping Zhao ... Björn W Schuller
Neural Networks | VOL. 141
Ziping Zhao, et. al.Ziping Zhao ... Björn W Schuller
23 Mar 2021
Neural Networks | VOL. 141

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Speech Emotion Recognition Using Dual-Stream Representation and Cross-Attention Fusion

Abstract

Talk to us

Similar Papers

More From: Electronics