Abstract
Emotion recognition plays an important role in human–computer interactions. Recent studies have focused on video emotion recognition in the wild and have run into difficulties related to occlusion, illumination, complex behavior over time, and auditory cues. State-of-the-art methods use multiple modalities, such as frame-level, spatiotemporal, and audio approaches. However, such methods have difficulties in exploiting long-term dependencies in temporal information, capturing contextual information, and integrating multi-modal information. In this paper, we introduce a multi-modal flexible system for video-based emotion recognition in the wild. Our system tracks and votes on significant faces corresponding to persons of interest in a video to classify seven basic emotions. The key contribution of this study is that it proposes the use of face feature extraction with context-aware and statistical information for emotion recognition. We also build two model architectures to effectively exploit long-term dependencies in temporal information with a temporal-pyramid model and a spatiotemporal model with “Conv2D+LSTM+3DCNN+Classify” architecture. Finally, we propose the best selection ensemble to improve the accuracy of multi-modal fusion. The best selection ensemble selects the best combination from spatiotemporal and temporal-pyramid models to achieve the best accuracy for classifying the seven basic emotions. In our experiment, we take benchmark measurement on the AFEW dataset with high accuracy.
Highlights
IntroductionEmotional cues provide universal signals that enable human beings to communicate during the course of daily activities and are a significant component of social interactions
Emotional cues provide universal signals that enable human beings to communicate during the course of daily activities and are a significant component of social interactions.For example, people will use facial expressions such as a big smile to signal their happiness to others when they feel joyful
We propose an overall system with face tracking and voting to select the main face for emotion recognition using two models based on spatiotemporal and temporal-pyramid architecture to efficiently improve emotion recognition
Summary
Emotional cues provide universal signals that enable human beings to communicate during the course of daily activities and are a significant component of social interactions. People will use facial expressions such as a big smile to signal their happiness to others when they feel joyful. People receive emotional cues (facial expressions, body gestures, tone of voice, etc.) from their social partners and combine them with their experiences to perceive emotions and make suitable decisions. In an attempt to develop methods based on new technologies in the computer vision and pattern recognition fields. This type of research has a wide range of applications, such as advertising, health monitoring, smart video surveillance, and development of intelligent robotic interfaces [1]
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have