Multimedia emotion prediction using movie script and spectrogram

Jin-Su Kim

doi:10.1007/s11042-020-08777-x

Abstract

This article proposes a multimedia emotion-prediction approach using movie scripts and spectrograms with speech information. First, a variety of information is extracted from textual dialogues in scripts for emotion prediction. In addition, spectrograms transformed from speech information help to identify subtle representations of difficult-to-predict emotions from scripts. Accent helps predict emotions because it is an important means of expressing emotion states in speech. These are to analyze emotion words with a similar tendency on the basis of the emotion keywords in scripts and spectrograms. Emotion candidate keywords are extracted from text data using morphological analysis, and representative emotion keywords are extracted through Word2Vec_ARSP. Emotion keywords and speech data from the last part of the dialogue are extracted and converted into images. This multimedia information is used for the input layer in a convolutional neural network. In this paper, we propose a multi-modal method for more efficiently extracting and predicting emotions by mixing and learning integrated multimedia information through the character’s speech and background sounds, as well as dialogue that can directly express the emotional situation of the context. In order to improve the accuracy of emotion prediction using multimedia information in movies, we propose a system with a CNN for learning, testing, and prediction using a multi-modal method. The proposed multi-modal system compensates for unpredictable emotions from certain parts of the text through the spectrogram. The prediction accuracy is improved by 20.9% and 6.7%, compared to using only text information and only voice information, respectively.

Full Text