Tensor Fusion Network for Multimodal Sentiment Analysis

Amir Zadeh,Louis-Philippe Morency,Soujanya Poria,Minghai Chen,Erik Cambria

doi:10.18653/v1/d17-1115

Abstract

Multimodal sentiment analysis is an increasingly popular research area, which extends the conventional language-based definition of sentiment analysis to a multimodal setup where other relevant modalities accompany language. In this paper, we pose the problem of multimodal sentiment analysis as modeling intra-modality and inter-modality dynamics. We introduce a novel model, termed Tensor Fusion Networks, which learns both such dynamics end-to-end. The proposed approach is tailored for the volatile nature of spoken language in online videos as well as accompanying gestures and voice. In the experiments, our model outperforms state-of-the-art approaches for both multimodal and unimodal sentiment analysis.

Highlights

Multimodal sentiment analysis (Morency et al, 2011; Zadeh et al, 2016b; Poria et al, 2015) is an increasingly popular area of affective computing research (Poria et al, 2017) that focuses on generalizing text-based sentiment analysis to opinionated videos, where three communicative modalities are present: language, visual, and acoustic.This generalization is vital to part of the NLP community dealing with opinion mining and sentiment analysis (Cambria et al, 2017) since there is a growing trend of sharing opinions in videos instead of text, specially in social media (Facebook, YouTube, etc.)
We introduce a new model, termed Tensor Fusion Network (TFN), which learns both the intra-modality and inter-modality dynamics end-to-end
CMU-MOSI dataset facilitates three prediction tasks, each of which we address in our experiments: 1) Binary Sentiment Classification 2) Five-Class Sentiment Classification and 3) Sentiment Regression in range [−3, 3]

Summary

Introduction

Multimodal sentiment analysis (Morency et al, 2011; Zadeh et al, 2016b; Poria et al, 2015) is an increasingly popular area of affective computing research (Poria et al, 2017) that focuses on generalizing text-based sentiment analysis to opinionated videos, where three communicative modalities are present: language (spoken words), visual (gestures), and acoustic (voice).This generalization is vital to part of the NLP community dealing with opinion mining and sentiment analysis (Cambria et al, 2017) since there is a growing trend of sharing opinions in videos instead of text, specially in social media (Facebook, YouTube, etc.). Multimodal sentiment analysis (Morency et al, 2011; Zadeh et al, 2016b; Poria et al, 2015) is an increasingly popular area of affective computing research (Poria et al, 2017) that focuses on generalizing text-based sentiment analysis to opinionated videos, where three communicative modalities are present: language (spoken words), visual (gestures), and acoustic (voice). A person speaking loudly “This movie is sick” would still be ambiguous The complexity of inter-modality dynamics is shown in the second trimodal example where the utterance “This movie is fair” is still weakly positive, given the strong influence of the word “fair”

Objectives

Methods

Results