Abstract

Music has the ability to invest even the tritest scenes with so much meaning when added to them. Human perceptions of music and image can be closely related to each other, as both can incite similar sensations and emotions. Advertising agencies often make use of audio and music over their visuals to engage more audiences and to convey the emotions associated with their content more effectively. Matching visuals and music to comparable feelings might help people perceive emotions more vividly and strongly. This paper proposes an effective cross-modal neural network that provides music recommendations to a user by generating matches between images and music over a common emotional vector space. Using the valence and arousal values, a combined image-music pair dataset has been created. The images incorporated in this dataset are leveraged from the OASIS dataset while the music part is queried using Spotify API and YouTube. A Transfer Learning approach is proposed with Convolution Neural Network architecture for training on this dataset using MobileNetV3, ResNet-18 and EfficientNetB4 for the images and SampleCNN for the raw audio clips. For any given image input, a list of top-n music recommendations shall be outputted. This concept thus aims to generate music and image matching based on various deep hidden features over the emotion space of the two modalities.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.