Abstract

We present a learning model for multimodal context-aware emotion recognition. Our approach combines multiple human co-occurring modalities (such as facial, audio, textual, and pose/gaits) and two interpretations of context. To gather and encode background semantic information for the first context interpretation from the input image/video, we use a self-attention-based CNN to encode. Similarly, for modeling the sociodynamic interactions among people (second context interpretation) in the input image/video, we use depth maps. We use multiplicative fusion to combine the modality and context channels, which learn to focus on the more informative input channels and suppress others for every incoming datapoint. We demonstrate the efficiency of our model on four benchmark emotion recognition datasets (IEMOCAP, CMU-MOSEI, EMOTIC, and GroupWalk). Our model outperforms on state of the art (SOTA) learning methods with an average $5\%-9\%$5%-9% increase over all the datasets. We also perform ablation studies to motivate the importance of multimodality, context, and multiplicative fusion.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call