Abstract

Human multi-modal emotions analysis includes time series data with different modalities, such as verbal, visual, and auditory. Due to different sampling rates from each modality, the collected data streams are unaligned. The asynchrony cross-modality increases the difficulty of multi-modal fusion. Therefore, we propose a new Cross-Modality Reinforcement model (CMR) based on recent advances in a cross-modality transformer, which performs multi-modal fusion in unaligned multi-modal sequences for emotion prediction. To deal with the long-time dependencies of unaligned sequences, we introduce a time domain aggregation to model the single modal, by aggregating the information in the time dimension, and enhance contextual dependencies. Moreover, a CMR strategy is introduced in our approach.With the main and secondary modalities as inputs to the module, main modal features are strengthened through cross-modality attention and cross-modality gate, and the secondary modality information flows to the main modality potentially, while retaining main modality-specific features and complementing the missing cues. This process gradually learns the common contributing features between the main and secondary modalities and reduces the noise caused by the variability of the modal features. Finally, the enhanced features are used to make predictions about human emotions. We evaluate CMR on two multi-modal sentiment analysis benchmark datasets, and we report the accuracy of 82.7% on the CMU-MOSI and 82.5% and CMU-MOSEI, respectively, which demonstrates our method outperforms current state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call