Weakening the Dominant Role of Text: CMOSI Dataset and Multimodal Semantic Enhancement Network.

Cong Jin,Guangzhe Zhao,Shuwu Zhang,Guixuan Zhang,Ming Yan,Cong Luo

doi:10.1109/tnnls.2023.3282953

Abstract

Multimodal sentiment analysis (MSA) is important for quickly and accurately understanding people's attitudes and opinions about an event. However, existing sentiment analysis methods suffer from the dominant contribution of text modality in the dataset; this is called text dominance. In this context, we emphasize that weakening the dominant role of text modality is important for MSA tasks. To solve the above two problems, from the perspective of datasets, we first propose the Chinese multimodal opinion-level sentiment intensity (CMOSI) dataset. Three different versions of the dataset were constructed: manually proofreading subtitles, generating subtitles using machine speech transcription, and generating subtitles using human cross-language translation. The latter two versions radically weaken the dominant role of the textual model. We randomly collected 144 real videos from the Bilibili video site and manually edited 2557 clips containing emotions from them. From the perspective of network modeling, we propose a multimodal semantic enhancement network (MSEN) based on a multiheaded attention mechanism by taking advantage of the multiple versions of the CMOSI dataset. Experiments with our proposed CMOSI show that the network performs best with the text-unweakened version of the dataset. The loss of performance is minimal on both versions of the text-weakened dataset, indicating that our network can fully exploit the latent semantics in nontext patterns. In addition, we conducted model generalization experiments with MSEN on MOSI, MOSEI, and CH-SIMS datasets, and the results show that our approach is also very competitive and has good cross-language robustness.

Full Text