Multiple kernel visual-auditory representation learning for retrieval

Hong Zhang,Xin Xu,Hehe Fan,Wenping Zhang,Wenhe Liu

doi:10.1007/s11042-016-3294-5

Abstract

Cross-media data representation, which focuses on semantics understanding of multimedia data in different modalities, is a rising hot topic in web media data analysis. The most challenging issues for cross-media data representation include: how to find underlying content-level data correlations and how to use such correlations in the representation model. Most traditional web media data analysis works are based on single modality data sources, such as Flickr images or YouTube videos, leaving cross-media data representation and semantics understanding wide open. In this paper, we propose a multiple kernel visual-auditory representation learning approach, which learns cross-media correlations from visual and auditory feature spaces with multiple kernel strategies. Besides, we give cross-media distance measure for image-audio retrieval in the mutual subspace of co-occurrence. Experiment results on the collected image-audio database are encouraging, and show that the performance of our approach is effective from multiple perspectives.

Full Text