Speech emotion recognition based on bi-directional acoustic–articulatory conversion

Haifeng Li,Xueying Zhang,Shufei Duan,Huizhi Liang

doi:10.1016/j.knosys.2024.112123

Abstract

Acoustic and articulatory signals are naturally coupled and complementary. The challenge of acquiring articulatory data and the nonlinear ill-posedness of acoustic–articulatory conversions have resulted in previous studies on speech emotion recognition (SER) primarily relying on unidirectional acoustic–articulatory conversions. However, these studies have ignored the potential benefits of bi-directional acoustic–articulatory conversion. Addressing the problem of nonlinear ill-posedness and effectively extracting and utilizing these two modal features in SER remain open research questions. To bridge this gap, this study proposes a Bi-A2CEmo framework that simultaneously addresses the bi-directional acoustic–articulatory conversion for SER. This framework comprises three components: a Bi-MGAN that addresses the nonlinear ill-posedness problem, KCLNet that enhances the emotional attributes of the mapped features, and ResTCN-FDA that fully exploits the emotional attributes of the features. Another challenge is the absence of a parallel acoustic–articulatory emotion database. To overcome this issue, this study utilizes electromagnetic articulography (EMA) to create a multi-modal acoustic–articulatory emotion database for Mandarin Chinese called STEM-E2VA. A comparative analysis is then conducted between the proposed method and state-of-the-art models to evaluate the effectiveness of the framework. Bi-A2CEmo achieves an accuracy of 89.04% in SER, which is an improvement of 5.27% compared with the actual acoustic and articulatory features recorded by the EMA. The results for the STEM-E2VA dataset show that Bi-MGAN achieves a higher accuracy in mapping and inversion than conventional conversion networks. Visualization of the mapped features before and after enhancement reveals that KCLNet reduces the intra-class spacing while increasing the inter-class spacing of the features. ResTCN-FDA demonstrates high recognition accuracy on three publicly available datasets. The experimental results show that the proposed bi-directional acoustic–articulatory conversion framework can significantly improve the SER performance.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Speech emotion recognition based on bi-directional acoustic–articulatory conversion

Abstract

Talk to us

Similar Papers

More From: Knowledge-Based Systems

Lead the way for us

Journal: Knowledge-Based Systems	Publication Date: Jun 13, 2024
Citations: 1

Similar Papers

Significance of Phonological Features in Speech Emotion Recognition
Wei Wang ... Lingjie Shen
International Journal of Speech Technology | VOL. 23
Wei Wang, et. al.Wei Wang ... Lingjie Shen
15 Jul 2020
International Journal of Speech Technology | VOL. 23

EdgeRNN: A Compact Speech Recognition Network With Spatio-Temporal Features for Edge Computing
Shunzhi Yang ... Kai Ye
IEEE Access | VOL. 8
Shunzhi Yang, et. al.Shunzhi Yang ... Kai Ye
01 Jan 2020
IEEE Access | VOL. 8

Speech Signal Imaging and Emotion Recognition Based on Symmetric-Diagonal Matrix Model
Zijun Yang ... Aoran Xi
-
Zijun Yang, et. al.Zijun Yang ... Aoran Xi
01 Jan 2023
01 Jan 2023

Speech Emotion Recognition Based on Self-Attention Weight Correction for Acoustic and Text Features
Jennifer Santoso ... Taiichi Hashimoto
IEEE Access | VOL. 10
Jennifer Santoso, et. al.Jennifer Santoso ... Taiichi Hashimoto
01 Jan 2021
IEEE Access | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Speech emotion recognition based on bi-directional acoustic–articulatory conversion

Abstract

Talk to us

Similar Papers

More From: Knowledge-Based Systems