Cross-Modal Variational Auto-Encoder for Content-Based Micro-Video Background Music Recommendation

Jing Yi,Yaochen Zhu,Jiayi Xie,Zhenzhong Chen

doi:10.1109/tmm.2021.3128254

Jing Yi, Yaochen Zhu + Show 2 more

Open Access

https://doi.org/10.1109/tmm.2021.3128254

Copy DOI

Journal: IEEE Transactions on Multimedia	Publication Date: Jan 1, 2023
Citations: 19	License type: publisher-specific, author manuscript

Affiliation: Wuhan University

Abstract

In this paper, we propose a cross-modal variational auto-encoder (CMVAE) for content-based micro-video background music recommendation. CMVAE is a hierarchical Bayesian generative model that matches relevant background music to a micro-video by projecting these two multimodal inputs into a shared low-dimensional latent space, where the alignment of two corresponding embeddings of a matched video-music pair is achieved by cross-generation. Moreover, the multimodal information is fused by the product-of-experts (PoE) principle, where the semantic information in visual and textual modalities of the micro-video are weighted according to their variance estimations such that the modality with a lower noise level is given more weights. Therefore, the micro-video latent variables contain less irrelevant information that results in a more robust model generalization. Furthermore, we establish a large-scale content-based micro-video background music recommendation dataset, TT-150k, composed of approximately 3,000 different background music clips associated to 150,000 micro-videos from different users. Extensive experiments on the established TT-150k dataset demonstrate the effectiveness of the proposed method. A qualitative assessment of CMVAE by visualizing some recommendation results is also included.

Full Text