Abstract

Aiming at precise sub-category classification of images, fine-grained image recognition requires the algorithms to enjoy a remarkable ability of subtle feature extraction. Recently, the architecture of Transformer has been successfully applied in vision tasks, bringing a novel approach to improve feature extraction performance of fine-grained image recognition algorithms. However, fine-grained image datasets are usually quite limited in capacity, which are unfavorable for the data-consuming training process of Transformers. In order to increase the available amount of data for training, in this paper we firstly introduce a stochastic image data augmentation method for Vision Transformer (ViT), which uses a Dense-DETR model to extract feature regions and performs random insertion and removal for the transformed patch sequence. To select the most informative sequence elements in the forward propagation pro-cess, we implement a feature patch selection strategy by applying an additional convolutional network structure to ViT encoders. Inspired from active learning, a contrastive loss utilizing the posterior information of paired images is also introduced as a penalty item of ViT's cross-entropy loss objective. Such strategies can make the ViT extract the most discriminative feature information from its input. Extensive experiments have supported that the proposed sequence-selective Vision Transformer reaches the highest recognition accuracies on several frequently-used fine-grained image datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call