Abstract
Cross-modal retrieval has become popular in recent years, particularly with the rise of multimedia. Generally, the information from each modality exhibits distinct representations and semantic information, which makes feature tends to be in separate latent spaces encoded with dual-tower architecture and makes it challenging to establish semantic relationships between modalities, resulting in poor retrieval performance. To address this issue, we propose a novel framework for cross-modal retrieval consisting of a cross-modal mixer, a masked autoencoder for pre-training, and a cross-modal retriever for downstream tasks. Specifically, we first adopt cross-modal mixer and mask modeling to fuse the original modality and eliminate redundancy. Then, an encoder–decoder architecture is applied to achieve a fuse-then-separate task in the pre-training phase. We feed masked fused representations into the encoder and reconstruct them with the decoder, ultimately separating the original data of two modalities. We use the pre-trained encoder in downstream tasks to build the cross-modal retrieval method. Extensive experiments on 2 real-world datasets show that our approach outperforms previous state-of-the-art methods in video–audio matching tasks, improving retrieval accuracy by up to 2×. Furthermore, we prove our model performance by transferring it to other downstream tasks as a universal model.
Submitted Version (Free)
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have