Abstract

We propose a deep learning architecture to solve the problem of multimodal fusion of multi-timescale temporal data, using music and video parts extracted from Music Videos (MVs) in particular. We capture the correlations between music and video at multiple levels by learning shared feature representations with Deep Belief Networks (DBN). The shared representations combine information from multiple modalities for decision making tasks, and are used to evaluate matching degrees between modalities and to retrieve matched modalities using single or multiple modalities as input. Moreover, we propose a novel deep architecture to handle temporal data at multiple timescales. When processing long sequences with varying length, we propose to extract hierarchical shared representations by concatenating deep representations at different levels, and to perform decision fusion with a feed forward neural network, which takes input from predictions of local and global classifiers trained with shared representations at each level. The effectiveness of our method is demonstrated through MV classification and retrieval.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.