Unsupervised image hashing is a widely used technique for large-scale image retrieval. This technique maps an image to a finite length of binary codes without extensive human-annotated data for compact storage and effective semantic retrieval. This study proposes a novel deep unsupervised double-bit hashing method for image retrieval. This approach is based on the double-bit hashing method, which has been shown to better preserve the neighboring structure of binary codes than single-bit hashing. Traditional double-bit hashing methods require the entire dataset to be processed simultaneously to determine optimal thresholding values of binary feature encoding. In contrast, the proposed method trains the hashing layer in a minibatch manner, allowing for adaptive threshold learning through a gradient-based optimization strategy. Additionally, unlike most former methods, which only train the hashing networks on top of fixed pre-trained neural networks backbone. The proposed learning framework trains both hashing and backbone networks alternately asynchronously. This strategy enables the model to maximize the learning capability of the hashing and backbone networks. Furthermore, adopting the lightweight Vision Transformer (ViT) in the proposed method allows the model to capture both local and global relationships between multiple image views exemplar, which lead to better generalization, thus maximizing the retrieval performance of the model. Extensive experiments on CIFAR10, NUW-WIDE, and FLICKR25K datasets validate that the proposed method has superior retrieval quality and computational efficiency than state-of-the-art methods.
Read full abstract