Abstract

While transformers have indeed improved image retrieval accuracy in computer vision, challenges persist, including insufficient and imbalanced feature extraction and the inability to create compact binary codes. This study introduces a novel approach for image retrieval called Vision Transformer with Deep Hashing (VTDH), combining a hybrid neural network and optimized metric learning. Our work offers significant contributions, summarized as follows: We introduce an innovative Strengthened External Attention (NEA) module capable of simultaneous multi-scale feature focus and comprehensive global context assimilation. This enriches the model’s comprehension of both overarching structure and semantics. Additionally, we propose a fresh balanced loss function to tackle the issue of imbalanced positive and negative samples within labels. Notably, this function employs sample labels as input, utilizing the mean value of all sample labels to quantify the frequency gap between positive and negative samples. This approach, combined with a customized balance weight, effectively addresses the challenge of label imbalance. Concurrently, we enhance the quantization loss function, intensifying its penalty for instances where the model’s binary code output surpasses ±1. This reinforcement results in a more robust and stable hash code output. The proposed method is assessed on prominent datasets, including CIFAR-10, NUS-WIDE, and ImageNet. Experimental outcomes reveal superior retrieval accuracy compared to current state-of-the-art techniques. Notably, the VTDH model achieves an exceptional mean average precision (mAP) of 97.3% on the CIFAR-10 dataset.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call