Multi-Modal Sentiment Classification With Independent and Interactive Knowledge via Semi-Supervised Learning

Dong Zhang,Guodong Zhou,Qiaoming Zhu,Shoushan Li

doi:10.1109/access.2020.2969205

Dong Zhang, Guodong Zhou + Show 2 more

Open Access

https://doi.org/10.1109/access.2020.2969205

Copy DOI

Abstract

Multi-modal sentiment analysis extends conventional text-based definition of sentiment analysis to a multi-modal setup where multiple relevant modalities are leveraged to perform sentiment analysis. In real applications, however, acquiring annotated multi-modal data is normally labor expensive and time-consuming. In this paper, we aim to reduce the annotation effort for multi-modal sentiment classification via semi-supervised learning. The key idea is to leverage the semi-supervised variational autoencoders to mine more information from unlabeled data for multi-modal sentiment analysis. Specifically, the mined information includes both the independent knowledge within single modality and the interactive knowledge among different modalities. Empirical evaluation demonstrates the great effectiveness of the proposed semi-supervised approach to multi-modal sentiment classification.

Highlights

As an increasingly popular area in affective computing [5], multi-modal sentiment analysis [2], [4] focuses on generalizing text-based sentiment analysis to a multi-modal setup, where various communicative modalities i.e. text, vision, and audio are present
We propose to perform semi-supervised learning with proper exploitation of both independent knowledge within single modality and interactive knowledge among different modalities to multi-modal sentiment analysis, motivated by following two factors
We propose a bi-modal SVAE approach by adding a loss term to measure the distance between the output sentiment vector representations from two independent uni-modal SVAEs

Summary

INTRODUCTION

As an increasingly popular area in affective computing [5], multi-modal sentiment analysis [2], [4] focuses on generalizing text-based sentiment analysis to a multi-modal setup, where various communicative modalities i.e. text (spoken language), vision (gestures), and audio (voice) are present. It is normally hard to obtain a sufficient amount of labeled data which integrate text, vision and audio since manual annotation of multi-modal data is labor expensive and time-consuming To well address this challenge, semi-supervised learning becomes crucial to successful application of multi-modal sentiment analysis. To the best of our knowledge, we are the first to perform semi-supervised learning to multi-modal sentiment analysis at the utterance-level which includes text, vision and audio modalities. We propose a multi-modal semi-supervised variational autoencoder approach to alleviate manual annotation and improve the performance of multi-modal sentiment classification with both the independent and interactive knowledge. Our approach much advances the state-of-the-art on two popular multi-modal sentiment analysis datasets, i.e., CMU-MOSI and CMU-MOSEI

RELATED WORK

AUDIO AND VIDEO-BASED SEMI-SUPERVISED SENTIMENT CLASSIFICATION

LOW-LEVEL FEATURES EXTRACTION

EXPERIMENTAL RESULTS

Findings

VIII. CONCLUSION