Abstract
Multi-modal speech emotion recognition is a study to predict emotion categories by combining speech data with other types of data, such as video, speech text transcription, body action, or facial expression when speaking, which will involve the fusion of multiple features. Most of the early studies, however, directly spliced multi-modal features in the fusion layer after single-modal modeling, resulting in ignoring the connection between speech and other modal features. As a result, we propose a novel multi-modal speech emotion recognition model based on multi-head attention fusion networks, which employs transcribed text and motion capture (MoCap) data involving facial expression, head rotation, and hand action to supplement speech data and perform emotion recognition. In unimodal, we use a two-layer Transformer’s encoder combination model to extract speech and text features separately, and MoCap is modeled using a deep residual shrinkage network. Simultaneously, We innovated by changing the input of the Transformer encoder to learn the similarities between speech and text, speech and MoCap, and then output text and MoCap features that are more similar to speech features, and finally, predict the emotion category using combined features. In the IEMOCAP dataset, our model outperformed earlier research with a recognition accuracy of 75.6%.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have