Deep-Learning-Based Multimodal Emotion Classification for Music Videos.

Yagya Raj Pandeya,Joonwhoan Lee,Bhuwan Bhattarai

doi:10.3390/s21144927

Yagya Raj Pandeya, Joonwhoan Lee + Show 1 more

Open Access

https://doi.org/10.3390/s21144927

Copy DOI

Journal: Sensors (Basel, Switzerland)	Publication Date: Jul 20, 2021
Citations: 38	License type: CC BY 4.0

Affiliation: Jeonbuk National University

Abstract

Music videos contain a great deal of visual and acoustic information. Each information source within a music video influences the emotions conveyed through the audio and video, suggesting that only a multimodal approach is capable of achieving efficient affective computing. This paper presents an affective computing system that relies on music, video, and facial expression cues, making it useful for emotional analysis. We applied the audio–video information exchange and boosting methods to regularize the training process and reduced the computational costs by using a separable convolution strategy. In sum, our empirical findings are as follows: (1) Multimodal representations efficiently capture all acoustic and visual emotional clues included in each music video, (2) the computational cost of each neural network is significantly reduced by factorizing the standard 2D/3D convolution into separate channels and spatiotemporal interactions, and (3) information-sharing methods incorporated into multimodal representations are helpful in guiding individual information flow and boosting overall performance. We tested our findings across several unimodal and multimodal networks against various evaluation metrics and visual analyzers. Our best classifier attained 74% accuracy, an f1-score of 0.73, and an area under the curve score of 0.926.

Highlights

Emotion is a psycho-physiological response triggered by the conscious or unconscious perception of external stimuli
Several networks are compared based on the evaluation score, complexity, and visual analysis by using a confusion matrix and a receiver operating characteristic (ROC) curve
The testing dataset included 300 music video samples that were never used in the training process

Summary

Introduction

Emotion is a psycho-physiological response triggered by the conscious or unconscious perception of external stimuli. There is a wide variety of factors associated with emotion, including mood, physical feeling, personality, motivation, and overall quality of life. Emotions play a vital role in decision making, communication, action, and a host of cognitive processes [1]. Music videos convey affective states through verbal, visual, and acoustic cues. Because they blend multiple types of information, a number of different methods of analysis are needed to understand their contents. In the context of music videos, identifying emotional cues requires analysis of sound, but visual information as well, including facial expressions, gestures, and physical reactions to environmental changes (e.g., changes in color scheme, lighting, motion, and camera focus points)

Methods

Results

Conclusion