Multi-head attention fusion networks for multi-modal speech emotion recognition

Junfeng Zhang,Lining Xing,Zhen Tan,Hongsen Wang,Kesheng Wang

doi:10.1016/j.cie.2022.108078

Abstract

Multi-modal speech emotion recognition is a study to predict emotion categories by combining speech data with other types of data, such as video, speech text transcription, body action, or facial expression when speaking, which will involve the fusion of multiple features. Most of the early studies, however, directly spliced multi-modal features in the fusion layer after single-modal modeling, resulting in ignoring the connection between speech and other modal features. As a result, we propose a novel multi-modal speech emotion recognition model based on multi-head attention fusion networks, which employs transcribed text and motion capture (MoCap) data involving facial expression, head rotation, and hand action to supplement speech data and perform emotion recognition. In unimodal, we use a two-layer Transformer’s encoder combination model to extract speech and text features separately, and MoCap is modeled using a deep residual shrinkage network. Simultaneously, We innovated by changing the input of the Transformer encoder to learn the similarities between speech and text, speech and MoCap, and then output text and MoCap features that are more similar to speech features, and finally, predict the emotion category using combined features. In the IEMOCAP dataset, our model outperformed earlier research with a recognition accuracy of 75.6%.

Full Text