Abstract

Multimodal multi-label emotion recognition (MMER) is a vital yet challenging task in affective computing. Despite significant progress in previous works, there are three limitations: (i) Limited applicability in real-world scenarios due to the assumption of pre-alignment of multimodal data. (ii) Inadequate utilization of long-term dependencies across modalities. (iii) Insufficient exploitation of correlations among emotion labels. In this paper, to overcome these limitations, a Multi-modal Attention Graph model with Dynamic Routing-by-Agreement (MAGDRA) is proposed. In MAGDRA, the fusion of multi-modal data can be performed without pre-alignment via pseudo-alignment algorithm (PAA). Furthermore, an Expectation-maximized Cross-modal Temporal (ECT) fusion approach is presented to effectively learn the cross-modal interactions and long-term dependencies among visual, audio and textual data. Moreover, to conquer the underconsideration of the correlation among multiple labels, a Reinforced Multi-Label Emotion Detection (RMLED) module is carefully designed. Extensive experiments are conducted on three public benchmark datasets, IEMOCAP, CMU-MOSI, and CMU-MOSEI, the results demonstrate that MAGDRA outperforms the existing methods and has the potential to generalize to multi-modal multi-label tasks in other domains.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.