Abstract

Generating music with emotion similar to that of an input video is a very relevant issue nowadays. Video content creators and automatic movie directors benefit from maintaining their viewers engaged, which can be facilitated by producing novel material eliciting stronger emotions in them. Moreover, there is currently a demand for more empathetic computers to aid humans in applications such as augmenting the perception ability of visually- and/or hearing-impaired people. Current approaches overlook the video’s emotional characteristics in the music generation step, only consider static images instead of videos, are unable to generate novel music, and require a high level of human effort and skills. In this study, we propose a novel hybrid deep neural network that uses an Adaptive Neuro-Fuzzy Inference System to predict a video’s emotion from its visual features and a deep Long Short-Term Memory Recurrent Neural Network to generate its corresponding audio signals with similar emotional inkling. The former is able to appropriately model emotions due to its fuzzy properties, and the latter is able to model data with dynamic time properties well due to the availability of the previous hidden state information. The novelty of our proposed method lies in the extraction of visual emotional features in order to transform them into audio signals with corresponding emotional aspects for users. Quantitative experiments show low mean absolute errors of 0.217 and 0.255 in the Lindsey and DEAP datasets, respectively, and similar global features in the spectrograms. This indicates that our model is able to appropriately perform domain transformation between visual and audio features. Based on experimental results, our model can effectively generate an audio that matches the scene eliciting a similar emotion from the viewer in both datasets, and music generated by our model is also chosen more often (code available online at https://github.com/gcunhase/Emotional-Video-to-Audio-with-ANFIS-DeepRNN).

Highlights

  • Generating music with emotion similar to that of an input video is a very relevant issue nowadays

  • The Proposed Method e proposed model is a novel hybrid deep neural network that uses an Adaptive Neuro-Fuzzy Inference System (ANFIS) to predict a video’s emotion from its visual features and an Long Short-Term Memory (LSTM)-Recurrent Neural Networks (RNN) to generate its corresponding audio features. e gathered audio features are used to restore the audio original waveform and compose the entire audio corresponding to a scene with similar emotional characteristics. e novelty of our proposed method lies in the extraction of visual emotional features in order to transform them into audio signals with corresponding emotional aspects for users

  • We proposed a novel hybrid deep neural network that uses an ANFIS to predict a video’s emotion from its visual features and an LSTM-RNN to generate audio features corresponding to the given visual features. e gathered audio features were used to restore the audio original waveform and compose the entire audio corresponding to a scene with similar emotional characteristics

Read more

Summary

Introduction

Generating music with emotion similar to that of an input video is a very relevant issue nowadays. We believe that further research in that area has the potential to positively affect human lives This relationship can be used to augment the perception ability of visually- and/or hearing-impaired people, allowing them to perceive the field of expression they are unable to. Another example is that with the rapid development of robotic technology in the field of engineering, humanoid robots are being expected to effectively interact with humans, and the key to that is understanding a user’s emotion by enabling the user with emotional intelligence. With these goals in mind, Affective Computing researchers have delved into emotion prediction induced by visual stimuli for various applications [7, 8]

Objectives
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.