Cross-modal retrieval has drawn wide interest for retrieval across different modalities (such as text, image, video, audio, and 3-D model). However, existing methods based on a deep neural network often face the challenge of insufficient cross-modal training data, which limits the training effectiveness and easily leads to overfitting. Transfer learning is usually adopted for relieving the problem of insufficient training data, but it mainly focuses on knowledge transfer only from large-scale datasets as a single-modal source domain (such as ImageNet) to a single-modal target domain. In fact, such large-scale single-modal datasets also contain rich modal-independent semantic knowledge that can be shared across different modalities. Besides, large-scale cross-modal datasets are very labor-consuming to collect and label, so it is significant to fully exploit the knowledge in single-modal datasets for boosting cross-modal retrieval. To achieve the above goal, this paper proposes a modal-adversarial hybrid transfer network (MHTN), which aims to realize knowledge transfer from a single-modal source domain to a cross-modal target domain and learn cross-modal common representation. It is an end-to-end architecture with two subnetworks. First, a modal-sharing knowledge transfer subnetwork is proposed to jointly transfer knowledge from a single modality in the source domain to all modalities in the target domain with a star network structure, which distills modal-independent supplementary knowledge for promoting cross-modal common representation learning. Second, a modal-adversarial semantic learning subnetwork is proposed to construct an adversarial training mechanism between the common representation generator and modality discriminator, making the common representation discriminative for semantics but indiscriminative for modalities to enhance cross-modal semantic consistency during the transfer process. Comprehensive experiments on four widely used datasets show the effectiveness of MHTN.
Read full abstract