DoA-ViT: Dual-objective Affine Vision Transformer for Data Insufficiency

Qiang Ren,Junli Wang

doi:10.1016/j.neucom.2024.128896

Abstract

Vision Transformers (ViTs) excel in large-scale image recognition tasks but struggle with limited data due to ineffective patch-level local information utilization. Existing methods focus on enhancing local representations at the model level but often treat all features equally, leading to noise from irrelevant information. Effectively distinguishing between discriminative features and irrelevant information helps minimize the interference of noise at the model level. To tackle this, we introduce Dual-objective Affine Vision Transformer (DoA-ViT), which enhances ViTs for data-limited tasks by improving feature discrimination. DoA-ViT incorporates a learnable affine transformation that associates transformed features with class-specific ones while preserving their intrinsic features. Additionally, an adaptive patch-based enhancement mechanism is designed to assign importance scores to patches, minimizing the impact of irrelevant information. These enhancements can be seamlessly integrated into existing ViTs as plug-and-play components. Extensive experiments on small-scale datasets show that DoA-ViT consistently outperforms existing methods, with visualization results highlighting its ability to identify critical image regions effectively.

Full Text