Abstract
Vision Transformers (ViTs) excel in large-scale image recognition tasks but struggle with limited data due to ineffective patch-level local information utilization. Existing methods focus on enhancing local representations at the model level but often treat all features equally, leading to noise from irrelevant information. Effectively distinguishing between discriminative features and irrelevant information helps minimize the interference of noise at the model level. To tackle this, we introduce Dual-objective Affine Vision Transformer (DoA-ViT), which enhances ViTs for data-limited tasks by improving feature discrimination. DoA-ViT incorporates a learnable affine transformation that associates transformed features with class-specific ones while preserving their intrinsic features. Additionally, an adaptive patch-based enhancement mechanism is designed to assign importance scores to patches, minimizing the impact of irrelevant information. These enhancements can be seamlessly integrated into existing ViTs as plug-and-play components. Extensive experiments on small-scale datasets show that DoA-ViT consistently outperforms existing methods, with visualization results highlighting its ability to identify critical image regions effectively.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.