Abstract

With the advent of the big data era, multimedia data is growing rapidly, and its data modalities is also becoming diversified. Therefore, the demand for the speed and accuracy of cross-modal information retrieval is increasing. Hashing-based cross-modal retrieval technology attracts widespread attention, it encodes multimedia data into a common binary hash space, thereby effectively measuring the correlation between samples from different modalities. In this paper, we propose a novel end-to-end deep cross-modal retrieval framework, namely Clustering-driven Deep Adversarial Hashing (CDAH), which has three main characteristics. Firstly, CDAH learns discriminative clusters recursively through a soft clustering model. It attempts to generate modal-invariant representations in a common space by obfuscating the modality classifier, which tries to distinguish different modalities according to the generated representations. Secondly, in order to minimize the modal gap between feature representations from different modalities with the same semantic label, and to maximize the distance between images and texts with different labels, CDAH constructs a fused-semantics matrix to integrate the original domain information from different modalities, serving as self-supervised information to refine the binary codes. Finally, CDAH skillfully uses a scaled tanh function to adaptively learn the binary codes, which will gradually converge to the original tricky binary coding problem. We conduct comprehensive experiments on four popular datasets, and the experimental results demonstrate the superiority of our model against the state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call