Abstract
Music videos contain a great deal of visual and acoustic information. Each information source within a music video influences the emotions conveyed through the audio and video, suggesting that only a multimodal approach is capable of achieving efficient affective computing. This paper presents an affective computing system that relies on music, video, and facial expression cues, making it useful for emotional analysis. We applied the audio–video information exchange and boosting methods to regularize the training process and reduced the computational costs by using a separable convolution strategy. In sum, our empirical findings are as follows: (1) Multimodal representations efficiently capture all acoustic and visual emotional clues included in each music video, (2) the computational cost of each neural network is significantly reduced by factorizing the standard 2D/3D convolution into separate channels and spatiotemporal interactions, and (3) information-sharing methods incorporated into multimodal representations are helpful in guiding individual information flow and boosting overall performance. We tested our findings across several unimodal and multimodal networks against various evaluation metrics and visual analyzers. Our best classifier attained 74% accuracy, an f1-score of 0.73, and an area under the curve score of 0.926.
Highlights
Emotion is a psycho-physiological response triggered by the conscious or unconscious perception of external stimuli
Several networks are compared based on the evaluation score, complexity, and visual analysis by using a confusion matrix and a receiver operating characteristic (ROC) curve
The testing dataset included 300 music video samples that were never used in the training process
Summary
Emotion is a psycho-physiological response triggered by the conscious or unconscious perception of external stimuli. There is a wide variety of factors associated with emotion, including mood, physical feeling, personality, motivation, and overall quality of life. Emotions play a vital role in decision making, communication, action, and a host of cognitive processes [1]. Music videos convey affective states through verbal, visual, and acoustic cues. Because they blend multiple types of information, a number of different methods of analysis are needed to understand their contents. In the context of music videos, identifying emotional cues requires analysis of sound, but visual information as well, including facial expressions, gestures, and physical reactions to environmental changes (e.g., changes in color scheme, lighting, motion, and camera focus points)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.