Unsupervised Domain Adaptive Semantic Segmentation Based on Clip-Guided Prototypical Contrastive Learning
Domain adaptive semantic segmentation aims to improve the model performance by bridging the gap existing between source and target domains. Recent works show that prototypical contrastive learning is a powerful approach. However, the prototypes can become unstable when there are significant variations in visual characteristics (e.g., color, scale and shape) among different images. Additionally, the prototypes generated from the source domain are highly correlated with domain information, which limits further gains in domain alignment. To address these issues, we propose a new method based on CLIP-guided Prototypical Contrastive Learning (CLIP-ProCL). Our approach simultaneously combines the rich text knowledge and image knowledge of CLIP to perform domain alignment. Towards the former, we obtain robust and domain-agnostic prototypes through the utilization of text prompts. Towards the latter, we leverage the image priors of CLIP to further guide the features learned by the segmentation network closer to the CLIP space. Experiments on the benchmark tasks GTA5 $\rightarrow$ Cityscapes and SYNTHIA $\rightarrow$ Cityscapes demonstrate that our approach outperforms the state-of-the-art methods. Our code is available at https://github.com/bupt-ai-cz/CLIP-ProCL.