Abstract

Multimodal sentiment analysis and emotion recognition has become an increasingly popular research area, where the biggest challenge is to efficiently fuse the input information from different modality. The recent success is largely credited to the attention-based models, e.g., transformer and its variants. However, the attention-based mechanism often neglects the coherency of human emotion due to its parallel structure. Inspired by the emotional arousal model in cognitive science, a Deep Emotional Arousal Network (DEAN) that is capable of simulating the emotional coherence is proposed in this paper, which incorporates the time dependence into the parallel structure of transformer. The proposed DEAN model consists of three components, i.e., a cross-modal transformer is devised to simulate the functions of perception analysis system of humans; a multimodal BiLSTM system is developed to imitate the cognitive comparator, and a multimodal gating block is introduced to mimic the activation mechanism in human emotional arousal model. We perform extensive comparison and ablation studies on three benchmarks for multimodal sentiment analysis and emotion recognition. The empirical results indicate that DEAN achieves state-of-the-art performance, and useful insights are derived from the results.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call