Seq2Seq2Sentiment: Multimodal Sequence to Sequence Models for Sentiment Analysis

Hai Pham,Barnabás Poczós,Thomas Manzini,Paul Pu Liang

doi:10.18653/v1/w18-3308

Abstract

Multimodal machine learning is a core research area spanning the language, visual and acoustic modalities. The central challenge in multimodal learning involves learning representations that can process and relate information from multiple modalities. In this paper, we propose two methods for unsupervised learning of joint multimodal representations using sequence to sequence (Seq2Seq) methods: a Seq2Seq Modality Translation Model and a Hierarchical Seq2Seq Modality Translation Model. We also explore multiple different variations on the multimodal inputs and outputs of these seq2seq models. Our experiments on multimodal sentiment analysis using the CMU-MOSI dataset indicate that our methods learn informative multimodal representations that outperform the baselines and achieve improved performance on multimodal sentiment analysis, specifically in the Bimodal case where our model is able to improve F1 Score by twelve points. We also discuss future directions for multimodal Seq2Seq methods.

Highlights

Sentiment analysis, which involves identifying a speaker’s sentiment, is an open research problem
Neural network based multimodal models have been proposed that are highly effective at learning multimodal representations for multimodal sentiment analysis (Chen et al, 2017; Poria et al, 2017; Zadeh et al, 2018a,b)
Our results show that using multimodal representations learned from our Seq2Seq modality translation method outperforms the baselines and achieves improved performance on multimodal sentiment analysis

Summary

Introduction

Sentiment analysis, which involves identifying a speaker’s sentiment, is an open research problem In this field, the majority of work done focused on unimodal methodologies - primarily textual analysis - where investigating was limited to identifying usage of words in positive and negative scenarios. (Kaushik et al, 2013) explores modalities such as audio, while (Wollmer et al, 2013) explores a multimodal approach to predicting sentiment. This push has been further bolstered by the advent of multimodal social media platforms, such as YouTube, Facebook, and VideoLectures which are used to express personal opinions on a worldwide scale. Neural network based multimodal models have been proposed that are highly effective at learning multimodal representations for multimodal sentiment analysis (Chen et al, 2017; Poria et al, 2017; Zadeh et al, 2018a,b)

Methods

Results

Discussion

Conclusion