Abstract

As a pixel-level prediction task in computer vision, semantic segmentation relies on sufficient and relatively fine image annotation for its excellent performance. This paper proposes a more flexible two-stage semantic segmentation framework for domain adaptation from virtual scenes to real street scenes. In the first stage, supervised pixel-level contrastive learning is performed using virtual scenes and their semantic labels, enabling the model to learn a highly structured pixel-level feature space. In the second stage, the pixel discriminant model obtained in the first stage is used for semantic prediction of the real scene to achieve the adaptation of the target domain. In this stage, the prediction results of the real scene are used as pseudo-labels to perform unsupervised segmentation of single-frame images in an iterative training manner. In particular, according to the sequence characteristics of street view images, this method iteratively trains and predicts consecutive frames of images, enabling the model to be fine-tuned and iterated as the environment changes. By exploring the above framework, we use semantic segmentation labels easily obtained in virtual scenes to carry out scene adaption. More importantly, knowledge mining in source domain and adaptive adjustment in target domain can be flexibly independent. We apply this method to environment perception for autonomous driving. In sequential street scenes, dynamic iterative prediction is performed on consecutive frames. This approach enhances the adaptability to various target domains. Finally we achieve more competitive results in the domain adaptation experiments of GTA->Cityscapes and GTA->ApolloScape.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call