Scene appearance changes drastically throughout the day. Existing semantic segmentation methods mainly focus on well-lit daytime scenarios and are not well designed to cope with such great appearance changes. Naively using domain adaption does not solve this problem because it usually learns a fixed mapping between the source and target domain and thus have limited generalization capability on all-day scenarios (i. e., from dawn to night). In this paper, in contrast to existing methods, we tackle this challenge from the perspective of image formulation itself, where the image appearance is determined by both intrinsic (e. g., semantic category, structure) and extrinsic (e. g., lighting) properties. To this end, we propose a novel intrinsic-extrinsic interactive learning strategy. The key idea is to interact between intrinsic and extrinsic representations during the learning process under spatial-wise guidance. In this way, the intrinsic representation becomes more stable and, at the same time, the extrinsic representation gets better at depicting the changes. Consequently, the refined image representation is more robust to generate pixel-wise predictions for all-day scenarios. To achieve this, we propose an All-in-One Segmentation Network (AO-SegNet) in an end-to-end manner. Large scale experiments are conducted on three real datasets (Mapillary, BDD100K and ACDC) and our proposed synthetic All-day CityScapes dataset. The proposed AO-SegNet shows a significant performance gain against the state-of-the-art under a variety of CNN and ViT backbones on all the datasets.
Read full abstract