Feature Extraction Network with Attention Mechanism for Data Enhancement and Recombination Fusion for Multimodal Sentiment Analysis

Qingfu Qi,Liyuan Lin,Rui Zhang

doi:10.3390/info12090342

Abstract

Multimodal sentiment analysis and emotion recognition represent a major research direction in natural language processing (NLP). With the rapid development of online media, people often express their emotions on a topic in the form of video, and the signals it transmits are multimodal, including language, visual, and audio. Therefore, the traditional unimodal sentiment analysis method is no longer applicable, which requires the establishment of a fusion model of multimodal information to obtain sentiment understanding. In previous studies, scholars used the feature vector cascade method when fusing multimodal data at each time step in the middle layer. This method puts each modal information in the same position and does not distinguish between strong modal information and weak modal information among multiple modalities. At the same time, this method does not pay attention to the embedding characteristics of multimodal signals across the time dimension. In response to the above problems, this paper proposes a new method and model for processing multimodal signals, which takes into account the delay and hysteresis characteristics of multimodal signals across the time dimension. The purpose is to obtain a multimodal fusion feature emotion analysis representation. We evaluate our method on the multimodal sentiment analysis benchmark dataset CMU Multimodal Opinion Sentiment and Emotion Intensity Corpus (CMU-MOSEI). We compare our proposed method with the state-of-the-art model and show excellent results.

Highlights

Published: 24 August 2021With the development of virtual community [1] and multimedia platforms such as YouTube and Facebook, people tend to discuss topics in videos rather than individual texts or pictures [2]
Based on this research idea, this paper proposes a multimodal sequence feature extraction network, which uses a new sequence fusion method and enhances different modal information to study the problem of multimodal emotion recognition
We evaluate the proposed model on the benchmark dataset of sentiment and sentiment analysis, namely, the CMU Multimodal View Sentiment and Mood Intensity (CMU-multimodal sentiment and sentiment analysis (MOSEI))

Summary

Introduction

With the development of virtual community [1] and multimedia platforms such as YouTube and Facebook, people tend to discuss topics in videos rather than individual texts or pictures [2]. They usually share their opinions, stories, and comments on these media sites in the form of videos. Multimodal sentiment analysis has become an important research field in Natural Language Processing It has become the basic research content of other subtasks in the NLP field, for example, video description generation [3,4], visual question answering [5,6], multimodal machine translation [7], and visual dialog [8,9]

Objectives

Methods

Results

Conclusion