Unsupervised cross-modal hashing has achieved great success in various information retrieval applications owing to its efficient storage usage and fast retrieval speed. Recent studies have primarily focused on training the hash-encoded networks by calculating a sample-based similarity matrix to improve the retrieval performance. However, there are two issues remain to solve: (1) The current sample-based similarity matrix only considers the similarity between image-text pairs, ignoring the different information densities of each modality, which may introduce additional noise and fail to mine key information for retrieval; (2) Most existing unsupervised cross-modal hashing methods only consider alignment between different modalities, while ignoring consistency between each modality, resulting in semantic conflicts. To tackle these challenges, a novel Deep High-level Concept-mining Jointing Hashing (DHCJH) model for unsupervised cross-modal retrieval is proposed in this study. DHCJH is able to capture the essential high-level semantic information from image modalities and integrate into the text modalities to improve the accuracy of guidance information. Additionally, a new hashing loss with a regularization term is introduced to avoid the cross-modal semantic collision and false positive pairs problems. To validate the proposed method, extensive comparison experiments on benchmark datasets are conducted. Experimental findings reveal that DHCJH achieves superior performance in both accuracy and efficiency. The code of DHCJH is available at Github.
Read full abstract