Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning

Yanan Shang,Tianqi Fu

doi:10.1016/j.iswa.2024.200436

Abstract

Recognition of various human emotions holds significant value in numerous real-world scenarios. This paper focuses on the multimodal fusion of speech and text for emotion recognition. A 39-dimensional Mel-frequency cepstral coefficient (MFCC) was used as a feature for speech emotion. A 300-dimensional word vector obtained through the Glove algorithm was used as the feature for text emotion. The bidirectional gate recurrent unit (BiGRU) method in deep learning was added for extracting deep features. Subsequently, it was combined with the multi-head self-attention (MHA) mechanism and the improved sparrow search algorithm (ISSA) to obtain the ISSA-BiGRU-MHA method for emotion recognition. It was validated on the IEMOCAP and MELD datasets. It was found that MFCC and Glove word vectors exhibited superior recognition effects as features. Comparisons with the support vector machine and convolutional neural network methods revealed that the ISSA-BiGRU-MHA method demonstrated the highest weighted accuracy and unweighted accuracy. Multimodal fusion achieved weighted accuracies of 76.52 %, 71.84 %, 66.72 %, and 62.12 % on the IEMOCAP, MELD, MOSI, and MOSEI datasets, suggesting better performance than unimodal fusion. These results affirm the reliability of the multimodal fusion recognition method, showing its practical applicability.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning

Abstract

Talk to us

Similar Papers

More From: Intelligent Systems with Applications

Lead the way for us

Journal: Intelligent Systems with Applications	Publication Date: Sep 8, 2024
License type: cc-by-nc-nd

Similar Papers

Video multimodal emotion recognition based on Bi-GRU and attention fusion
Ruo-Hong Huan ... Kai-Kai Chi
Multimedia Tools and Applications | VOL. 80
Ruo-Hong Huan, et. al.Ruo-Hong Huan ... Kai-Kai Chi
31 Oct 2020
Multimedia Tools and Applications | VOL. 80

Pitch prediction from Mel-frequency cepstral coefficients using sparse spectrum recovery
M V Achuth Rao ... Prasanta Kumar Ghosh
-
M V Achuth Rao, et. al.M V Achuth Rao ... Prasanta Kumar Ghosh
01 Mar 2017
01 Mar 2017

Drug–drug interaction extraction based on multimodal feature fusion by Transformer and BiGRU
Changqing Yu ... Guanghao Ma
Frontiers in Drug Discovery | VOL. 4
Changqing Yu, et. al.Changqing Yu ... Guanghao Ma
29 Oct 2024
Frontiers in Drug Discovery | VOL. 4

Analysis and prediction of acoustic speech features from mel-frequency cepstral coefficients in distributed speech recognition architectures
Jonathan Darch ... Saeed Vaseghi
The Journal of the Acoustical Society of America | VOL. 124
Jonathan Darch, et. al.Jonathan Darch ... Saeed Vaseghi
01 Dec 2008
The Journal of the Acoustical Society of America | VOL. 124

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning

Abstract

Talk to us

Similar Papers

More From: Intelligent Systems with Applications