Abstract

SummaryHuman emotions can be recognized from facial expressions captured in videos. It is a growing research area in which many have attempted to improve video emotion detection in both lab‐controlled and unconstrained environments. While existing methods show a decent recognition accuracy on lab‐controlled datasets, they deliver much lower accuracy in a real‐world uncontrolled environment, where a variety of challenges need to be addressed such as variations in illumination, head pose, and individual appearance. Moreover, automatically identifying the key frames consisting of the expression from real‐world videos is another challenge. In this article, to overcome these challenges, we provide a video emotion recognition via multiple feature fusion method. First, a uniform local binary pattern (LBP) and the scale‐invariant feature transform features are extracted from each frame in the video sequences. By applying a random forest classifier, all of the static frames are then labelled by the related emotion class. In this way, the key frames can be automatically identified, including neutral and other expressions. Furthermore, from the key frames, a new geometric feature vector and the LBP from three orthogonal planes are extracted. To further improve robustness, audio features are extracted from the video sequences as an additional dimension to augmenting visual facial expression analysis. The audio and visual features are fused through a kernel multimodal sparse representation. Finally, the corresponding emotion labels to the video sequences can be assigned when a multimodal quality measure specifies the quality of each modality and its role in the decision. The results on both acted facial expressions in the Wild and MMI datasets demonstrate that the proposed method outperforms several counterpart video emotion recognition methods.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call