Abstract

Multimodal emotion recognition is an important research direction within artificial intelligence. In this study, we propose a model for acoustic-articulatory emotion recognition. For the acoustic data, this model extracts Interspeech 2009 (IS 09) features, and for articulatory data, it extracts our proposed phase space reconstruction-geometric (PSR-G) features. It then feeds the spliced features into an improved sparrow search algorithm (ISSA)-cascaded deep learning (CDL) network to obtain the final recognition results. We propose the PSR-G features to reflect phase and geometric information, where the reconstructed phase-space signal of the articulatory data is plotted in three-dimensional space and geometric features based on distances and angles are extracted. We also propose the ISSA-CDL network for emotion recognition, in which the CDL network effectively merges acoustic and articulatory features, and fully leverages the advantages of the one-dimensional convolutional neural network, a multi-head self-attention mechanism, and double-layer bidirectional long short-term memory. Finally, we propose the ISSA, in which we use a tent map and the firefly algorithm to optimize the parameters of the CDL network to decrease the instability and randomness induced by subjective experience. We conducted experiments using the self-recorded STEM−E2VA database and obtained the following results: (1) PSR-G features lead to higher recognition accuracy for the articulatory data than other existing features. (2) The CDL network effectively merges bimodal features, and the ISSA effectively optimizes the parameters of the CDL network. (3) The final accuracy in acoustic-articulatory emotion recognition is 95.87 ± 0.29%, which is higher than that for acoustic features (81.16 ± 0.47%) or articulatory features (93.27 ± 0.47%) alone.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call