<p>Understanding human emotions across diverse data sources presents challenges in various applications including healthcare, human-machine interaction, security, marketing, and gaming. Prior research has explored fusion techniques to address multimodal data heterogeneity, yet often overlooks the importance of discriminative unimodal information and potential complementarity among fusion strategies. Recognizing emotions from video and audio data poses challenges such as non-verbal cues interpretation, varying expression, ambiguity in context, and the need for nuanced feature extraction to capture subtle emotional nuances accurately. To tackle these issues, it is imperative to employ efficient emotion representation and multimodal fusion techniques, as these tasks have significant importance within the realm of multifaceted recognizing study. This study introduced a novel approach, optimized multi-layer self-attention network for emotion recognition (OMSN-ER), focusing on feature-level data fusion. OMSN-ER precisely assesses emotional states by merging facial and voice data, utilizing a multi-layer progressive dense residual fusion network and a self-attention mountain gazelle convolution neural network. Implemented in Python with the RAVDESS dataset, the methodology achieves exceptional accuracy (0.9908), surpassing benchmarks and demonstrating efficacy in multimodal emotion recognition. This research represents promising advancements in the intricate field of emotion recognition.</p>
Read full abstract