Cross-modal retrieval aims to elucidate information fusion, imitate human learning, and advance the field. Although previous reviews have primarily focused on binary and real-value coding methods, there is a scarcity of techniques grounded in deep representation learning. In this paper, we concentrated on harmonizing cross-modal representation learning and the full-cycle modeling of high-level semantic associations between vision and language, diverging from traditional statistical methods. We systematically categorized and summarized the challenges and open issues in implementing current technologies and investigated the pipeline of cross-modal retrieval, including pre-processing, feature engineering, pre-training tasks, encoding, cross-modal interaction, decoding, model optimization, and a unified architecture. Furthermore, we propose benchmark datasets and evaluation metrics to assist researchers in keeping pace with cross-modal retrieval advancements. By incorporating recent innovative works, we offer a perspective on potential advancements in cross-modal retrieval.
Read full abstract