Abstract

Multimodal emotion recognition is an emerging interdisciplinary field of research in the area of affective computing and sentiment analysis. It aims at exploiting the information carried by signals of different nature to make emotion recognition systems more accurate. This is achieved by employing a powerful multimodal fusion method. In this study, a hybrid multimodal data fusion method is proposed in which the audio and visual modalities are fused using a latent space linear map and then, their projected features into the cross-modal space are fused with the textual modality using a Dempster-Shafer (DS) theory-based evidential fusion method. The evaluation of the proposed method on the videos of the DEAP dataset shows its superiority over both decision-level and non-latent space fusion methods. Furthermore, the results reveal that employing Marginal Fisher Analysis (MFA) for feature-level audio-visual fusion results in higher improvement in comparison to cross-modal factor analysis (CFA) and canonical correlation analysis (CCA). Also, the implementation results show that exploiting textual users' comments with the audiovisual content of movies improves the performance of the system.

Highlights

  • Emotion recognition is the process of specifying the affective state of people

  • It plays an important role in affective computing and human-computer interaction (HCI) applications [1]

  • We propose a hybrid fusion method for multimodal emotion recognition which benefits from both featureand decision-level fusion

Read more

Summary

Introduction

Emotion recognition is the process of specifying the affective state of people. It plays an important role in affective computing and human-computer interaction (HCI) applications [1]. Different applications benefit from emotion recognition, including video games [3], military healthcare [4], tutoring systems [5], predicting customer satisfaction [6], and Twitter analysis [7]. Multimodal emotion recognition has attracted an increasing attention of researchers as it can overcome the limitations of monomodal systems [8]–[10]. Multimodal emotion recognition fuses complementary information of different modalities at different fusion levels. These levels can be classified into two categories: prior to matching and

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.