Multimodal emotion recognition has emerged as a promising approach to capture the complex nature of human emotions by integrating information from various sources such as physiological signals, visual behavioral cues, and audio-visual content. However, current methods often struggle with effectively processing redundant or conflicting information across modalities and may overlook implicit inter-modal correlations. To address these challenges, this paper presents a novel multimodal emotion recognition framework which integrates audio-visual features with viewers' EEG data to enhance emotion classification accuracy. The proposed approach employs modality-specific encoders to extract spatiotemporal features, which are then aligned through contrastive learning to capture inter-modal relationships. Additionally, cross-modal attention mechanisms are incorporated for effective feature fusion across modalities. The framework, comprising pre-training, fine-tuning, and testing phases, is evaluated on multiple datasets of emotional responses. The experimental results demonstrate that the proposed multimodal approach, which combines audio-visual features with EEG data, is highly effective in recognizing emotions, highlighting its potential for advancing emotion recognition systems.
Read full abstract