Recently, zero-shot cross-modal hashing has gained significant popularity due to its ability to effectively realize the retrieval of emerging concepts within multimedia data. Although the existing approaches have shown impressive results, the following limitations still need to be solved: (1) Labels in large-scale multimodal datasets in real scenes are usually incomplete or partially missing. (2) The existing methods ignore the influence of features-wise low-level similarity and label distribution on retrieval performance. (3) The representation ability of dense hash codes limits its discriminative potential. To solve these issues, we introduce an effective cross-modal retrieval framework called two-stage zero-shot sparse hashing with missing labels (TZSHML). Specifically, we learn a classifier through the partially known labeled samples to predict the labels of unlabeled data. Then, we use the reliable information in the correctly marked labels to recover the missing labels. The predicted and recovered labels are combined to obtain more accurate labels for the samples with missing labels. In addition, we employ sample-wise fine-grained similarity and cluster-wise similarity to learn hash codes. Therefore, TZSHML ensures that more samples with similar semantics are clustered together. Besides, we apply high-dimensional sparse hash codes to explore richer semantic information. Finally, the drift and interaction terms are introduced into the learning of the hash function to further narrow the gap between different modalities. Extensive experimental results demonstrate the competitiveness of our approach over other state-of-the-art methods in zero-shot retrieval scenarios with missing labels. The source code of this paper can be obtained from https://github.com/szq0816/TZSHML.
Read full abstract