Dual-TBNet: Improving the Robustness of Speech Features via Dual-Transformer-BiLSTM for Speech Emotion Recognition

Zheng Liu,Xin Kang,Fuji Ren

doi:10.1109/taslp.2023.3282092

Abstract

Speech emotion recognition has always been one of the topics that have attracted a lot of attention from many researchers. In traditional feature fusion methods, the speech features used only come from the data set, and the weak robustness of features can easily lead to overfitting of the model. In addition, these methods often use simple concatenation to fuse features, which will cause the loss of speech information. In this paper, to solve the above problems and improve the recognition accuracy, we utilize self-supervised learning to enhance the robustness of speech features and propose a feature fusion model(Dual-TBNet) that consists of two 1D convolutional layers, two Transformer modules and two bidirectional long short-term memory (BiLSTM) modules. Our model uses 1D convolution to take features of different segment lengths and dimension sizes as input, uses the attention mechanism to capture the correspondence between the two features, and uses the bidirectional time series module to enhance the contextual information of the fused features. We designed a total of four fusion models to fuse five pre-trained features and acoustic features. In the comparison experiments, the Dual-TBNet model achieved a recognition accuracy and F1 score of 95.7% and 95.8% on the CASIA dataset, 66.7% and 65.6% on the eNTERFACE05 dataset, 64.8% and 64.9% on the IEMOCAP dataset, 84.1% and 84.3% on the EMO-DB dataset and 83.3% and 82.1% on the SAVEE dataset. The Dual-TBNet model effectively fuses acoustic features of different lengths and dimensions with pre-trained features, enhancing the robustness of the features, and achieved the best performance.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Dual-TBNet: Improving the Robustness of Speech Features via Dual-Transformer-BiLSTM for Speech Emotion Recognition

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing

Lead the way for us

Journal: IEEE/ACM Transactions on Audio, Speech, and Language Processing	Publication Date: Jan 1, 2023
Citations: 6

Similar Papers

Speech Emotion Recognition Based on BLSTM and CNN Feature Fusion
Lv Huilian ... Hu Weiping
-
Lv Huilian, et. al.Lv Huilian ... Hu Weiping
19 Jun 2020
19 Jun 2020

Exploring Complementary Features in Multi-Modal Speech Emotion Recognition
Suzhen Wang ... Yifeng Ma
-
Suzhen Wang, et. al.Suzhen Wang ... Yifeng Ma
04 Jun 2023
04 Jun 2023

Feature-Enhanced Multi-Task Learning for Speech Emotion Recognition Using Decision Trees and LSTM
Chun Wang ... Xizhong Shen
Electronics | VOL. 13
Chun Wang, et. al.Chun Wang ... Xizhong Shen
10 Jul 2024
Electronics | VOL. 13

Speech Emotion Recognition Based on Self-Attention Weight Correction for Acoustic and Text Features
Jennifer Santoso ... Taiichi Hashimoto
IEEE Access | VOL. 10
Jennifer Santoso, et. al.Jennifer Santoso ... Taiichi Hashimoto
01 Jan 2021
IEEE Access | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Dual-TBNet: Improving the Robustness of Speech Features via Dual-Transformer-BiLSTM for Speech Emotion Recognition

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing