Abstract Semantic image segmentation models typically require extensive pixel-wise annotations, which are costly to obtain and prone to biases. Our work investigates learning semantic segmentation in urban scenes without any manual annotation. We propose a novel method for learning pixel-wise semantic segmentation using raw, uncurated data from vehicle-mounted cameras and LiDAR sensors, thus eliminating the need for manual labeling. Our contributions are as follows. First, we develop a novel approach for cross-modal unsupervised learning of semantic segmentation by leveraging synchronized LiDAR and image data. A crucial element of our method is the integration of an object proposal module that examines the LiDAR point cloud to generate proposals for spatially consistent objects. Second, we demonstrate that these 3D object proposals can be aligned with corresponding images and effectively grouped into semantically meaningful pseudo-classes. Third, we introduce a cross-modal distillation technique that utilizes image data partially annotated with the learnt pseudo-classes to train a transformer-based model for semantic image segmentation. Fourth, we demonstrate further significant improvements of our approach by extending the proposed model using a teacher-student distillation with an exponential moving average and incorporating soft targets from the teacher. We show the generalization capabilities of our method by testing on four different testing datasets (Cityscapes, Dark Zurich, Nighttime Driving, and ACDC) without any fine-tuning. We present an in-depth experimental analysis of the proposed model including results when using another pre-training dataset, per-class and pixel accuracy results, confusion matrices, PCA visualization, k-NN evaluation, ablations of the number of clusters and LiDAR’s density, supervised finetuning as well as additional qualitative results and their analysis.
Read full abstract