Abstract
In the realm of Vision-Language Pretraining models, achieving robust and adaptive representations is a cornerstone for successfully handling the unpredictability of real-world scenarios. This paper delves into two pivotal misalignment challenges inherent to Contrastive Language-Image Pre-training (CLIP) models: attention misalignment, which leads to an overemphasis on background elements rather than salient objects, and predictive category misalignment, characterized by the model’s struggle to discern between classes based on similarity. These misalignments undermine the representational stability essential for dynamic, real-world applications. To address these challenges, we propose AlignCLIP, an advanced fine-tuning methodology distinguished by its attention alignment loss, designed to calibrate the distribution of attention across multi-head attention layers. Furthermore, AlignCLIP introduces semantic label smoothing, a technique that leverages textual class similarities to refine prediction hierarchies. Through comprehensive experimentation on a variety of datasets and in scenarios involving distribution shifts and unseen classes, we demonstrate that AlignCLIP significantly enhances the stability of representations and shows superior generalization capabilities.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have