Abstract

Multi-modal sentiment analysis extends conventional text-based definition of sentiment analysis to a multi-modal setup where multiple relevant modalities are leveraged to perform sentiment analysis. In real applications, however, acquiring annotated multi-modal data is normally labor expensive and time-consuming. In this paper, we aim to reduce the annotation effort for multi-modal sentiment classification via semi-supervised learning. The key idea is to leverage the semi-supervised variational autoencoders to mine more information from unlabeled data for multi-modal sentiment analysis. Specifically, the mined information includes both the independent knowledge within single modality and the interactive knowledge among different modalities. Empirical evaluation demonstrates the great effectiveness of the proposed semi-supervised approach to multi-modal sentiment classification.

Highlights

  • As an increasingly popular area in affective computing [5], multi-modal sentiment analysis [2], [4] focuses on generalizing text-based sentiment analysis to a multi-modal setup, where various communicative modalities i.e. text, vision, and audio are present

  • We propose to perform semi-supervised learning with proper exploitation of both independent knowledge within single modality and interactive knowledge among different modalities to multi-modal sentiment analysis, motivated by following two factors

  • We propose a bi-modal SVAE approach by adding a loss term to measure the distance between the output sentiment vector representations from two independent uni-modal SVAEs

Read more

Summary

INTRODUCTION

As an increasingly popular area in affective computing [5], multi-modal sentiment analysis [2], [4] focuses on generalizing text-based sentiment analysis to a multi-modal setup, where various communicative modalities i.e. text (spoken language), vision (gestures), and audio (voice) are present. It is normally hard to obtain a sufficient amount of labeled data which integrate text, vision and audio since manual annotation of multi-modal data is labor expensive and time-consuming To well address this challenge, semi-supervised learning becomes crucial to successful application of multi-modal sentiment analysis. To the best of our knowledge, we are the first to perform semi-supervised learning to multi-modal sentiment analysis at the utterance-level which includes text, vision and audio modalities. We propose a multi-modal semi-supervised variational autoencoder approach to alleviate manual annotation and improve the performance of multi-modal sentiment classification with both the independent and interactive knowledge. Our approach much advances the state-of-the-art on two popular multi-modal sentiment analysis datasets, i.e., CMU-MOSI and CMU-MOSEI

RELATED WORK
AUDIO AND VIDEO-BASED SEMI-SUPERVISED SENTIMENT CLASSIFICATION
LOW-LEVEL FEATURES EXTRACTION
EXPERIMENTAL RESULTS
Findings
VIII. CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call