Unsupervised Domain Adaptation (UDA) aims to transfer models trained on a labeled source domain to an unlabeled target domain. Due to the excellent generalization ability of Vision Language Models (VLMs) such as CLIP in downstream tasks, most recent methods apply CLIP to UDA tasks through learning domain-specific text prompts for source and target domains separately. However, these methods fail to dynamically adjust image features based on the characteristics of their respective domains, thereby limiting their alignment with domain-specific text prompts in CLIP’s joint space, which is a key factor in improving classification performance in the target domain. To bridge this gap, we propose a Unified Text-Image Space Alignment with Cross-Modal Prompting (UTISA) framework for UDA. Firstly, we introduce a cross-modal prompt learning (CMP) module to generate domain-specific image prompts and layer-specific image prompts for the visual branch to encode domain-specific knowledge globally and locally. Secondly, under the guidance of image prompts, we introduce a domain-aware multi-layer feature fusion (DMF) module to construct multi-layer domain features for each domain and enhance the image features with these multi-layer domain features, which enables the image features to better reflect the characteristics of their respective domains, thereby promoting their alignment with domain-specific text prompts. Moreover, we introduce a perturbation-driven regularization (PDR) mechanism for the target domain to enhance the robustness and generalization of the model. The experiments demonstrate that UTISA achieves the best performance on three mainstream UDA benchmarks, including 87.9% on OfficeHome, 90.9% on VisDA-2017, 62.4% on DomainNet.
Read full abstract