Abstract

Human brain extracts the spatial and temporal semantic information by processing the multi-modalities, which has contextually meaningful for perceiving and understanding the emotional state of an individual. However, there are two main challenges in modeling multimodal sequences: 1) the different sampling rates from multimodal data make the cross-modal interactions very difficult; 2) how to efficiently fuse unimodal representations and effectively capture relationships among multimodal data. In this paper, we design the weighted cross-modal attention mechanism, which not only captures the temporal correlation information and the spatial dependence information of each modality, but also dynamically adjusts the weight of each modality across different time steps. And the unimodal subtasks are led for assisting the representation learning of specific modality to jointly train the multimodal tasks and unimodal subtasks to explore the complementary relationships of each modality. Our model gets a new state-of-the-art record on the CMU-MOSI dataset and brings noticeable performance improvements on all the metrics. For the CMU-MOSEI dataset, the F1 score of the binary classification, the 7-class task, and the regression task of our model are still the highest among all models and the proposed model is only lower than the multimodal split attention fusion (MSAF) model with aligned data on the accuracy of the binary classification, showing the great performance of the suggested method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call