Abstract
The context, such as scenes and objects, plays an important role in video emotion recognition. The emotion recognition accuracy can be further improved when the context information is incorporated. Although previous research has considered the context information, the emotional clues contained in different images may be different, which is often ignored. To address the problem of emotion difference between different modes and different images, this paper proposes a hierarchical attention-based multimodal fusion network for video emotion recognition, which consists of a multimodal feature extraction module and a multimodal feature fusion module. The multimodal feature extraction module has three subnetworks used to extract features of facial, scene, and global images. Each subnetwork consists of two branches, where the first branch extracts the features of different modes, and the other branch generates the emotion score for each image. Features and emotion scores of all images in a modal are aggregated to generate the emotion feature of the modal. The other module takes multimodal features as input and generates the emotion score for each modal. Finally, features and emotion scores of multiple modes are aggregated, and the final emotion representation of the video will be produced. Experimental results show that our proposed method is effective on the emotion recognition dataset.
Highlights
Emotion recognition is an important content of a comprehensive understanding of video scenes
With the success of deep convolution neural networks (CNNs) in the field of image classification and object detection, researchers attempt to extract face features based on deep neural networks to further improve the performance of emotion recognition [3, 4]
We first build a dataset for human emotion recognition in video, named multimodal human emotion dataset (MHED)
Summary
Emotion recognition is an important content of a comprehensive understanding of video scenes. With the success of deep convolution neural networks (CNNs) in the field of image classification and object detection, researchers attempt to extract face features based on deep neural networks to further improve the performance of emotion recognition [3, 4] It cannot model the temporal evolution of emotion expression. HAMF takes the image sequence of face, scene, and context as input and can learn a discrimination video emotion representation that can make full use of the differences of different modes and images.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.