Mix-ViT: Mixing attentive vision transformer for ultra-fine-grained visual categorization

Xiaohan Yu,Jun Wang,Yang Zhao,Yongsheng Gao

doi:10.1016/j.patcog.2022.109131

Abstract

Ultra-fine-grained visual categorization (ultra-FGVC) moves down the taxonomy level to classify sub-granularity categories of fine-grained objects. This inevitably poses a challenge, i.e., classifying highly similar objects with limited samples, which impedes the performance of recent advanced vision transformer methods. To that end, this paper introduces Mix-ViT, a novel mixing attentive vision transformer to address the above challenge towards improved ultra-FGVC. The core design is a self-supervised module that mixes the high-level sample tokens and learns to predict whether a token has been substituted after attentively substituting tokens. This drives the model to understand the contextual discriminative details among inter-class samples. Via incorporating such a self-supervised module, the network gains more knowledge from the intrinsic structure of input data and thus improves generalization capability with limited training sample. The proposed Mix-ViT achieves competitive performance on seven publicly available datasets, demonstrating the potential of vision transformer compared to CNN for the first time in addressing the challenging ultra-FGVC tasks. The code is available at https://github.com/Markin-Wang/MixViT

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Mix-ViT: Mixing attentive vision transformer for ultra-fine-grained visual categorization

Abstract

Talk to us

Similar Papers

More From: Pattern Recognition

Lead the way for us

Journal: Pattern Recognition	Publication Date: Oct 28, 2022
Citations: 34

Similar Papers

PolSAR Image Classification Using a Superpixel-Based Composite Kernel and Elastic Net
Yice Cao ... Peng Zhang
Remote Sensing | VOL. 13
Yice Cao, et. al.Yice Cao ... Peng Zhang
22 Jan 2021
Remote Sensing | VOL. 13

Sample-screening MKL method via boosting strategy for hyperspectral image classification
Yanfeng Gu ... Huan Liu
Neurocomputing | VOL. 173
Yanfeng Gu, et. al.Yanfeng Gu ... Huan Liu
26 Sep 2015
Neurocomputing | VOL. 173

CLE-ViT: Contrastive Learning Encoded Transformer for Ultra-Fine-Grained Visual Categorization
Xiaohan Yu ... Yongsheng Gao
-
Xiaohan Yu, et. al.Xiaohan Yu ... Yongsheng Gao
01 Aug 2023
01 Aug 2023

Long-term clinical and economic outcomes associated with angiotensin II receptor blocker use in hypertensive patients
Jason P Swindle ... Sumeet Panjabi
Current Medical Research and Opinion | VOL. 27
Jason P Swindle, et. al.Jason P Swindle ... Sumeet Panjabi
18 Jul 2011
Current Medical Research and Opinion | VOL. 27

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Mix-ViT: Mixing attentive vision transformer for ultra-fine-grained visual categorization

Abstract

Talk to us

Similar Papers

More From: Pattern Recognition