Abstract

Modality representation learning is a critical issue in multimodal sentiment analysis (MSA). A good sentiment representation should contain as much effective information as possible while being discriminative enough to be better recognized. Previous attention-based MSA methods mainly rely on word-level feature interactions to capture intra-modality and inter-modality relations, which may lead to the loss of essential sentiment information. Furthermore, they primarily focus on information fusion but do not give enough importance to feature discrimination. To address these challenges, we propose a modal-utterance-temporal attention network with multimodal sentiment loss (MUTA-Net) for learning discriminative multi-relation representations, where the modal-utterance-temporal attention (MUTA) and multimodal sentiment loss (MMSL) are the two core units. First, we propose MUTA incorporate utterance-level feature vectors into the interactions of different modalities, which can help extract more useful relationships as the utterance-level vectors may contain sentiment information complementary to word-level vectors. Second, MMSL is designed to achieve a large inter-class distance and a small intra-class feature distance simultaneously in multimodal scenes to enhance the discriminative power of feature representations. Our experiments on four public multimodal datasets show that MUTA-Net outperforms previous baselines significantly.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call