Transformer-Based Clipped Contrastive Quantization Learning For Unsupervised Image Retrieval

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Unsupervised image retrieval aims to learn the important visual characteristics without any given level to retrieve the similar images for a given query image. The Convolutional Neural Network (CNN)-based approaches have been extensively exploited with self-supervised contrastive learning for image hashing. However, the existing approaches suffer due to lack of effective utilization of global features by CNNs and biased-ness created by false negative pairs in the contrastive learning. In this paper, we propose a Tran-sClippedCLR model by encoding the global context of an image using Transformer having local context through patch based processing, by generating the hash codes through product quantization and by avoiding the potential false negative pairs through clipped contrastive learning. The proposed model is tested with superior performance for unsupervised image retrieval on benchmark datasets, including CIFAR10, NUS-Wide and Flickr25K, as compared to the recent state-ofthe-art deep models. The results using the proposed clipped contrastive learning are greatly improved on all datasets as compared to same backbone network with vanilla contrastive learning.

Similar Papers
  • Conference Article
  • Cite Count Icon 22
  • 10.1145/3394171.3414537
PyRetri: A PyTorch-based Library for Unsupervised Image Retrieval by Deep Convolutional Neural Networks
  • Oct 12, 2020
  • Benyi Hu + 5 more

Despite significant progress of applying deep learning methods to the field of content-based image retrieval, there has not been a software library that covers these methods in a unified manner. In order to fill this gap, we introduce PyRetri, an open source library for deep learning based unsupervised image retrieval. The library encapsulates the retrieval process in several stages and provides functionality that covers various prominent methods for each stage. The idea underlying its design is to provide a unified platform for deep learning based image retrieval research, with high usability and extensibility. The project source code, with usage examples, sample data and pre-trained models are available at https://github.com/PyRetri/.

  • Conference Article
  • 10.1109/icip.2019.8802994
Unsupervised Image Retrieval With Mask-Based Prominent Feature Accumulation
  • Sep 1, 2019
  • Xinyi Wang + 3 more

For unsupervised image retrieval, which features are chosen for final representation determines its performance. Nowadays, unsupervised methods can deal with most image retrieval tasks. However, it is still a challenging task to retrieve images with complex background. In this paper, we propose a new approach of mask-based prominent feature accumulation (MPFA), which utilizes MAX-Mask and SUM-Mask to retain significant features in each channel for all database images. Channels are then sorted by MPFA to select representative channels of feature maps extracted from pre-trained CNN. After that, the final image representation is generated by aggregating the selected channels. Experiments on the public datasets show improvement of our proposed approach compared to state-of-the-art methods, especially for images with complex background.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/icpr.2010.1062
Unsupervised Image Retrieval with Similar Lighting Conditions
  • Aug 1, 2010
  • J Felix Serrano + 4 more

In this work a new method to retrieve images with similar lighting conditions is presented. It is based on automatic clustering and automatic indexing. Our proposal belongs to Content Based Image Retrieval (CBIR) category. The goal is to retrieve from a database, images (by their content) with similar lighting conditions. When we look at images taken from outdoor scenes, much of the information perceived depends on the lighting conditions. The proposal combines fixed and random extracted points for feature extraction. The describing features are the mean, the standard deviation and the homogeneity (from the co-occurrence matrix) of a sub-image extracted from the three color channels: (H, S, I). A K-MEANS algorithm and a 1-NN classifier are used to build an indexed database of 300 images in order to retrieve images with similar lighting conditions applied on sky regions such as: sunny, partially cloudy and completely cloudy. One of the advantages of the proposal is that we do not need to manually label the images for their retrieval. The performance of our framework is demonstrated through several experimental results, including the improved rates for images retrieval with similar lighting conditions. A comparison with another similar work is also presented.

  • Research Article
  • 10.1175/aies-d-25-0003.1
WV-Net: A Foundation Model for SAR Ocean Satellite Imagery
  • Oct 1, 2025
  • Artificial Intelligence for the Earth Systems
  • Yannik Glaser + 7 more

The European Space Agency’s Sentinel-1 (S-1) satellite mission has captured more than 10 million images of the ocean surface using C-band synthetic aperture radar (SAR WV mode). While machine learning is a promising approach for detecting and quantifying various geophysical signatures in these images, scientists are limited by the cost of manual data annotation for any particular task. We propose to use contrastive self-supervised learning on the full archive of unannotated WV-mode images to train a semantic embedding model named WV-Net. In experiments, we show that WV-Net embeddings outperform those from models that were pretrained with natural images (ImageNet) on four downstream tasks: multilabel classification [0.96 average area under the receiver operating characteristic (AUROC) vs 0.95], wave height regression [0.50 root-mean-square error (RMSE) vs 0.60], near-surface air temperature regression (0.90 RMSE vs 0.97), and unsupervised image retrieval [0.41 class-averaged mean average precision (mAP) vs 0.37]. WV-Net embeddings also scale better in data-sparse settings, and fine-tuned WV-Net models are more robust to hyperparameter choices. The WV-Net foundation model is publicly available and can be adapted to a variety of data analysis and exploration tasks in geophysical research. Significance Statement Satellite imagery of the ocean from the Sentinel-1 (S-1) mission contains global and temporal information about ocean waves, atmospheric patterns, biological signatures, sea ice, and ships. However, the dataset is currently not fully utilized due to the size and complexity of the data. Our WV-Net foundation model makes this large dataset more accessible to researchers by providing a pretrained model that converts images into semantic vector embeddings, making the dataset amenable to analysis using advanced data-driven methods.

  • Research Article
  • Cite Count Icon 11
  • 10.1109/tip.2023.3268868
Rank Flow Embedding for Unsupervised and Semi-Supervised Manifold Learning.
  • Jan 1, 2023
  • IEEE Transactions on Image Processing
  • Lucas Pascotti Valem + 2 more

Impressive advances in acquisition and sharing technologies have made the growth of multimedia collections and their applications almost unlimited. However, the opposite is true for the availability of labeled data, which is needed for supervised training, since such data is often expensive and time-consuming to obtain. While there is a pressing need for the development of effective retrieval and classification methods, the difficulties faced by supervised approaches highlight the relevance of methods capable of operating with few or no labeled data. In this work, we propose a novel manifold learning algorithm named Rank Flow Embedding (RFE) for unsupervised and semi-supervised scenarios. The proposed method is based on ideas recently exploited by manifold learning approaches, which include hypergraphs, Cartesian products, and connected components. The algorithm computes context-sensitive embeddings, which are refined following a rank-based processing flow, while complementary contextual information is incorporated. The generated embeddings can be exploited for more effective unsupervised retrieval or semi-supervised classification based on Graph Convolutional Networks. Experimental results were conducted on 10 different collections. Various features were considered, including the ones obtained with recent Convolutional Neural Networks (CNN) and Vision Transformer (ViT) models. High effective results demonstrate the effectiveness of the proposed method on different tasks: unsupervised image retrieval, semi-supervised classification, and person Re-ID. The results demonstrate that RFE is competitive or superior to the state-of-the-art in diverse evaluated scenarios.

  • Dissertation
  • Cite Count Icon 2
  • 10.18297/etd/662
Image annotation and retrieval based on multi-modal feature clustering and similarity propagation.
  • Feb 12, 2015
  • Mohamed Ismail

The performance of content-based image retrieval systems has proved to be inherently constrained by the used low level features, and cannot give satisfactory results when the user's high level concepts cannot be expressed by low level features. In an attempt to bridge this semantic gap, recent approaches started integrating both low level-visual features and high-level textual keywords. Unfortunately, manual image annotation is a tedious process and may not be possible for large image databases. In this thesis we propose a system for image retrieval that has three mains components. The first component of our system consists of a novel possibilistic clustering and feature weighting algorithm based on robust modeling of the Generalized Dirichlet (GD) finite mixture. Robust estimation of the mixture model parameters is achieved by incorporating two complementary types of membership degrees. The first one is a posterior probability that indicates the degree to which a point fits the estimated distribution. The second membership represents the degree of typicality and is used to indentify and discard noise points. Robustness to noisy and irrelevant features is achieved by transforming the data to make the features independent and follow Beta distribution, and learning optimal relevance weight for each feature subset within each cluster. We extend our algorithm to find the optimal number of clusters in an unsupervised and efficient way by exploiting some properties of the possibilistic membership function. We also outline a semi-supervised version of the proposed algorithm. In the second component of our system consists of a novel approach to unsupervised image annotation. Our approach is based on: (i) the proposed semi-supervised possibilistic clustering; (ii) a greedy selection and joining algorithm (GSJ); (iii) Bayes rule; and (iv) a probabilistic model that is based on possibilistic memebership degrees to annotate an image. The third component of the proposed system consists of an image retrieval framework based on multi-modal similarity propagation. The proposed framework is designed to deal with two data modalities: low-level visual features and high-level textual keywords generated by our proposed image annotation algorithm. The multi-modal similarity propagation system exploits the mutual reinforcement of relational data and results in a nonlinear combination of the different modalities. Specifically, it is used to learn the semantic similarities between images by leveraging the relationships between features from the different modalities. The proposed image annotation and retrieval approaches are implemented and tested with a standard benchmark dataset. We show the effectiveness of our clustering algorithm to handle high dimensional and noisy data. We compare our proposed image annotation approach to three state-of-the-art methods and demonstrate the effectiveness of the proposed image retrieval system.

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/cac.2018.8623769
Stacked Denoising Auto-encoder Based Image Representation for Visual Loop Closure Detection
  • Nov 1, 2018
  • Baoyang Ding + 4 more

Loop closure detection is important in S-LAM (Simultaneous Location and Mapping) for its capability of relocation. Many techniques have been proposed such as Kalman filtering based methods. On the other hand, loop closure in the visual based SLAM can also be treated as an image retrieval problem. In recently years, deep learning is paid great attention and it is very appropriate for image classification and retrieval. However, deep learning usually askes for big data which may not be satisfied in visual based SLAM. In this paper, we proposed an unsupervised image retrieval method for loop closure detection. The SDA (Stacked Auto-encoder) is employed to translate images to high-dimensional representations, and then loop clousre detection is manipulated. The experiments show that, our method outperform the traditional BoW(Bag-of-Word) method in the ‘New College’ dataset and ‘City Centre’ dataset.

  • Research Article
  • 10.54254/2755-2721/38/20230532
Research on unsupervised image retrieval methods based on contrastive learning
  • Feb 7, 2024
  • Applied and Computational Engineering
  • Hanhong Liu

In the convergence of fashion and artificial intelligence (AI), significant strides have been made in areas such as clothing recognition, retrieval, and classification, enabled by advanced AI technologies and expansive annotated datasets. As the AI in Fashion market continues to surge, the future of the fashion industry promises to be redefined by intelligent, efficient, and more accessible solutions. Image retrieval, one of the important parts in AI, has experienced remarkable growth, empowered by advanced algorithms and vast annotated datasets, making it a crucial component in various domains such as digital libraries, online marketing. Therefore, this report mainly provides an extensive review of image retrieval methods and the emerging paradigm of contrastive learning, underscoring their relevance and applications in the realm of artificial intelligence. This paper primarily reviews the technologies in the amalgamation of the image retrieval field and contrastive learning. It elucidates the history and progression of image retrieval, offers a methodical analysis of the two primary approachestext-based image retrieval and content-based image retrievaland examines how contrastive learning is employed in image retrieval systems.

  • Research Article
  • Cite Count Icon 2
  • 10.1109/tnnls.2024.3363163
Locating Target Regions for Image Retrieval in an Unsupervised Manner.
  • Mar 1, 2025
  • IEEE Transactions on Neural Networks and Learning Systems
  • Bo-Jian Zhang + 3 more

Image retrieval performance can be improved by training a convolutional neural network (CNN) model with annotated data to facilitate accurate localization of target regions. However, obtaining sufficiently annotated data is expensive and impractical in real settings. It is challenging to achieve accurate localization of target regions in an unsupervised manner. To address this problem, we propose a new unsupervised image retrieval method named unsupervised target region localization (UTRL) descriptors. It can precisely locate target regions without supervisory information or learning. Our method contains three highlights: 1) we propose a novel zero-label transfer learning method to address the problem of co-localization in target regions. This enhances the potential localization ability of pretrained CNN models through a zero-label data-driven approach; 2) we propose a multiscale attention accumulation method to accurately extract distinguishable target features. It distinguishes the importance of features by using local Gaussian weights; and 3) we propose a simple yet effective method to reduce vector dimensionality, named twice-PCA-whitening (TPW), which reduces the performance degradation caused by feature compression. Notably, TPW is a robust and general method that can be widely applied to image retrieval tasks to improve retrieval performance. This work also facilitates the development of image retrieval based on short vector features. Extensive experiments on six popular benchmark datasets demonstrate that our method achieves about 7% greater mean average precision (mAP) compared to existing state-of-the-art unsupervised methods.

  • Conference Article
  • Cite Count Icon 9
  • 10.1109/mipr.2019.00025
Adversarial Learning for Content-Based Image Retrieval
  • Mar 1, 2019
  • Ling Huang + 4 more

In this paper, we propose a novel adversarial learning based framework, unsupervised adversarial image retrieval (UAIR) for content-based image retrieval. Different from most content-based image retrieval methods that use supervised learning in convolutional neural network to obtain semantic image features, we adopt adversarial training scheme to train the retrieval framework with unannotated information. A generative model and a discriminative model are designed for UAIR to learn together by pursuing competing goals. The generative model selects well-matched images and passes them to the discriminative model. The discriminative model judges the selected images as feedbacks to the generative model. Experimental results demonstrate the effectiveness of the proposed UAIR on two widely used databases. The performance of UAIR has been compared with other state-of-the-art image retrieval methods, including recently reported GAN-based methods. Experimental results show that the proposed UAIR achieves significant improvement in retrieval performance.

  • Research Article
  • Cite Count Icon 18
  • 10.1109/lsp.2019.2892233
Unsupervised Deep Hashing With Adaptive Feature Learning for Image Retrieval
  • Mar 1, 2019
  • IEEE Signal Processing Letters
  • Yuxuan Zhu + 2 more

The hashing method is widely used for large-scale image retrieval due to its low time and space complexity. However, the existing deep hashing methods are mainly designed for labeled datasets. Without supervised information, retrieval performance on unlabeled datasets is not guaranteed. In this letter, we propose a novel deep hashing approach for unsupervised image retrieval applications. The contributions are two-fold. First, the pseudolabels are generated using their global features aggregated from the pretrained network and employed as self-supervised information to optimize the objective function of training. Second, adaptive feature learning is used in this deep hashing framework to perform simultaneous hash function learning and global features learning in an unsupervised manner. The experimental results validated the effectiveness of the proposed method, obtaining state-of-the-art performances on several public datasets such as CIFAR-10, Holidays, and Oxford5k.

  • Conference Article
  • 10.1109/icme46284.2020.9102905
Image Retrieval Based On Multi-Semantic Region Weighting And Multi-Scale Flatness Weighting
  • Jun 10, 2020
  • Zhuoyi Li + 3 more

The feature representation of images is the key to bridge the semantic gap and make computer understand images in the tasks of image retrieval. Semantic regions and salient targets are irreplaceable parts of cognitive images in image recognition. This paper proposes an unsupervised image retrieval method based on multi-semantic region weighting and multi-scale flatness weighting. Firstly, we divide the semantic regions by using the Fully Convolutional Network (FCN) and calculate the multi-Semantic weight map (S-mask) to obtain the global features. Secondly, we introduce a flatness-weighted strategy to weight feature maps and aggregate the multi-scale features to obtain the local features. Finally, we cascade the global features and the local features to construct the final image representation. Experimental results on two widely-used databases demonstrate that the proposed method is effective and significantly outperforms the state-of-the-art retrieval methods.

  • Research Article
  • Cite Count Icon 4
  • 10.1007/s00521-018-3684-x
Bagging trees with Siamese-twin neural network hashing versus unhashed features for unsupervised image retrieval
  • Aug 18, 2018
  • Neural Computing and Applications
  • Mohamed Waleed Fakhr + 2 more

The goal of this paper is twofold. Firstly, a Siamese-twin random projection neural network (ST-RPNN) is proposed for unsupervised binary hashing of images and compared with state-of-the-art techniques. Secondly, a comparison between Hamming-distance-based retrieval and a proposed bagging trees retrieval (BT-retrieval) algorithm operating directly on the PCA features is made with respect to performance, storage and search time. The ST-RPNN is made of two identical random projection neural networks and is trained to produce similar binary codes for similar input image pairs and different binary codes otherwise. The learning process is divided into two steps: a fast sparse neurons selection algorithm followed by an unsupervised bagging trees algorithm to extract the compact required-length code. Moreover, a BT-retrieval algorithm is proposed in this paper as a fast retrieval tool that ranks the database with respect to a query without distance calculations. Furthermore, (BT-PCA) is a novel extension where the BT-retrieval is applied directly on the PCA features with a significantly lower time search than Hamming-distance-based approach. The proposed technique is compared with ten unsupervised image binary hashing techniques on the COREL1K dataset and the CIFAR10 dataset. The proposed technique obtained better precision–recall results than all compared techniques on the COREL1K dataset, and better than eight of them on the CIFAR10 dataset.

  • Research Article
  • Cite Count Icon 3
  • 10.1007/s10994-024-06710-z
Intramodal consistency in triplet-based cross-modal learning for image retrieval
  • Feb 28, 2025
  • Machine Learning
  • Mario Mallea + 2 more

Cross-modal retrieval requires building a common latent space that captures and correlates information from different data modalities, usually images and texts. Cross-modal training based on the triplet loss with hard negative mining is a state-of-the-art technique to address this problem. This paper shows that such approach is not always effective in handling intra-modal similarities. Specifically, we found that this method can lead to inconsistent similarity orderings in the latent space, where intra-modal pairs with unknown ground-truth similarity are ranked higher than cross-modal pairs representing the same concept. To address this problem, we propose two novel loss functions that leverage intra-modal similarity constraints available in a training triplet but not used by the original formulation. Additionally, this paper explores the application of this framework to unsupervised image retrieval problems, where cross-modal training can provide the supervisory signals that are otherwise missing in the absence of category labels. Up to our knowledge, we are the first to evaluate cross-modal training for intra-modal retrieval without labels. We present comprehensive experiments on MS-COCO and Flickr30k, demonstrating the advantages and limitations of the proposed methods in cross-modal and intra-modal retrieval tasks in terms of performance and novelty measures. We also conduct a case study on the ROCO dataset to assess the performance of our method on medical images and present an ablation study on one of our approaches to understanding the impact of the different components of the proposed loss function. Our code is publicly available on GitHub https://github.com/MariodotR/FullHN.git.

  • Research Article
  • Cite Count Icon 14
  • 10.1016/j.patrec.2020.03.032
Graph-based selective rank fusion for unsupervised image retrieval
  • Apr 3, 2020
  • Pattern Recognition Letters
  • Lucas Pascotti Valem + 1 more

Graph-based selective rank fusion for unsupervised image retrieval

Save Icon
Up Arrow
Open/Close