The fusion of facial expression and Electroencephalogram (EEG) signals in a multi-modal framework is a comprehensive and accurate method for emotion recognition. However, current methods often tend to directly concatenate two modalities, thereby overlooking the interplay between modalities. Furthermore, the selection of single static facial images for extracting facial features neglected the dynamic changes in facial expressions. These limitations have been identified as factors that constrain the accuracy and stability of the model. In this study, an innovative multimodal emotion recognition network was introduced, which using continuous facial expressions and EEG signals. It incorporated the cross-modal attention fusion mechanism to establish robust correlations between modal feature vectors, thereby generating amalgamated vectors with mutual information. Additionally, a Self-Attention Convolutional Long Short-Term Memory (SA-ConvLSTM) was employed to capture spatiotemporal information from facial expression images. The model proposed in this paper was experimentally evaluated on the DEAP and MAHNOB-HCI datasets. Its recognition accuracy was higher than the existing advanced research methods. At the same time, it still performed well in the Leave-One-Subject-Out (LOSO) experiment based on DEAP dataset. The experimental results showed the effectiveness of the proposed model in the task of multimodal emotion recognition.
Read full abstract