Abstract

Zero-shot semantic segmentation (ZS3), the process of classifying unseen classes without explicit training samples, poses a significant challenge. Despite notable progress made by pre-trained vision-language models, they have a problem of “supervision leakage” in the unseen classes due to their large-scale pre-trained data. For example, CLIP is trained on 400M image–text pairs that contain large label space categories. So, it is not convincing for real “zero-shot” learning in machine learning. This paper introduces SwinZS3, an innovative framework that explores the “no-supervision-leakage” zero-shot semantic segmentation with an image encoder that is not pre-trained on the seen classes. SwinZS3 integrates the strengths of both visual and semantic embeddings within a unified joint embedding space. This approach unifies a transformer-based image encoder with a language encoder. A distinguishing feature of SwinZS3 is the implementation of four specialized loss functions in the training progress: cross-entropy loss, semantic-consistency loss, regression loss, and pixel-text score loss. These functions guide the optimization process based on dense semantic prototypes derived from the language encoder, making the encoder adept at recognizing unseen classes during inference without retraining. We evaluated SwinZS3 with standard ZS3 benchmarks, including PASCAL VOC and PASCAL Context. The outcomes affirm the effectiveness of our method, marking a new milestone in “no-supervison-leakage” ZS3 task performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call