Emotion recognition has become a subject of considerable interest in recent times, owing to its diverse and far-reaching applications in various fields. These applications span from enhancing human-computer interactions to assessing mental health and improving entertainment systems. The proposed study presents a novel approach for emotion recognition by fusing audio and video modalities using low-rank fusion techniques. The proposed methodology leverages the complementary nature of audio and video data in capturing emotional cues. Audio data often encapsulates tone, speech patterns, and vocal nuances, while video data captures facial expressions, body language, and gestures. However, the challenge lies in effectively integrating these two modalities to enhance recognition accuracy. To address the challenge, it employs low-rank fusion, a dimensionality reduction technique that extracts the most informative features from both modalities while minimizing redundancy. Furthermore, it presents the implementation of the chosen low-rank fusion algorithm in a real-world emotion recognition system. The results can contribute to advancing the field of emotion recognition by providing a practical and efficient solution for combining audio and video data to achieve more robust and accurate emotion classification. Keywords: Deep Learning; Emotion Recognition; Human-Computer Interaction; Low-Rank Fusion; Multimodal Fusion.
Read full abstract