Abstract

Due to the advantages of low storage cost and fast retrieval efficiency, deep hashing methods are widely used in cross-modal retrieval. Images are usually accompanied by corresponding text descriptions rather than labels. Therefore, unsupervised methods have been widely concerned. However, due to the modal divide and semantic differences, existing unsupervised methods cannot adequately bridge the modal differences, leading to suboptimal retrieval results. In this paper, we propose CLIP-based cycle alignment hashing for unsupervised vision-text retrieval (CCAH), which aims to exploit the semantic link between the original features of modalities and the reconstructed features. Firstly, we design a modal cyclic interaction method that aligns semantically within intramodality, where one modal feature reconstructs another modal feature, thus taking full account of the semantic similarity between intramodal and intermodal relationships. Secondly, introducing GAT into cross-modal retrieval tasks. We consider the influence of text neighbour nodes and add attention mechanisms to capture the global features of text modalities. Thirdly, Fine-grained extraction of image features using the CLIP visual coder. Finally, hash encoding is learned through hash functions. The experiments demonstrate on three widely used datasets that our proposed CCAH achieves satisfactory results in total retrieval accuracy. Our code can be found at: https://github.com/CQYIO/CCAH.git.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call