Abstract

Affective computing is an emerging area of research that aims to enable intelligent systems to recognize, feel, infer and interpret human emotions. The widely spread online and off-line music videos are one of the rich sources of human emotion analysis because it integrates the composer’s internal feeling through song lyrics, musical instruments performance and visual expression. In general, the metadata which music video customers to choose a product includes high-level semantics like emotion so that automatic emotion analysis might be necessary. In this research area, however, the lack of a labeled dataset is a major problem. Therefore, we first construct a balanced music video emotion dataset including diversity of territory, language, culture and musical instruments. We test this dataset over four unimodal and four multimodal convolutional neural networks (CNN) of music and video. First, we separately fine-tuned each pre-trained unimodal CNN and test the performance on unseen data. In addition, we train a 1-dimensional CNN-based music emotion classifier with raw waveform input. The comparative analysis of each unimodal classifier over various optimizers is made to find the best model that can be integrate into a multimodal structure. The best unimodal modality is integrated with corresponding music and video network features for multimodal classifier. The multimodal structure integrates whole music video features and makes final classification with the SoftMax classifier by a late feature fusion strategy. All possible multimodal structures are also combined into one predictive model to get the overall prediction. All the proposed multimodal structure uses cross-validation to overcome the data scarcity problem (overfitting) at the decision level. The evaluation results using various metrics show a boost in the performance of the multimodal architectures compared to each unimodal emotion classifier. The predictive model by integration of all multimodal structure achieves 88.56% in accuracy, 0.88 in f1-score, and 0.987 in area under the curve (AUC) score. The result suggests human high-level emotions are automatically well classified in the proposed CNN-based multimodal networks, even though a small amount of labeled data samples is available for training.

Highlights

  • Music is a language that communicates some emotion to anyone, even to plants or animals, and visual perception is more playing a crucial role in our daily lives as they aid decisionmaking, learning, communication, and situation awareness in human-centric environments

  • We focus to classify the music video emotion by convolutional neural networks

  • This section covers preprocessing for network input, transfer learning, unimodal and multimodal approach for music and video, and late fusion for emotion classification of deep neural network

Read more

Summary

Introduction

Music is a language that communicates some emotion to anyone, even to plants or animals, and visual perception is more playing a crucial role in our daily lives as they aid decisionmaking, learning, communication, and situation awareness in human-centric environments. The human emotion incurred from visual or acoustic information is vague and subjective, and depends on human thought and environmental changes. This kinds of vagueness usually reflecting on music video emotion analysis. An immense amount of music video needs to be classified according to these attributes, preferably in an automated way. This kind of need can be seen highly in online music video stores, music galleries, digital music market, and file-sharing networks. The proper evaluation can help the music manager to better understand the social demand and end user test

Objectives
Methods
Findings
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call