Semantic image segmentation requires large amounts of pixel-wise labeled training data. Creating such data generally requires labor-intensive human manual annotation. Thus, extracting training data from video games is a practical idea, and pixel-wise annotation can be automated from video games with near perfect accuracy. However, experiments show that models trained using raw video-game data cannot be directly applied to real-world scenes because of the domain shift problem. In this paper, we propose a domain-adaptive network based on CycleGAN that translates scenes from a virtual domain to a real domain in both the pixel and feature spaces. Our contributions are threefold: 1) we propose a dynamic perceptual network to improve the quality of the generated images in the feature spaces, making the translated images are more conducive to semantic segmentation; 2) we introduce a novel weighted self-regularization loss to prevent semantic changes in translated images; and 3) we design a discrimination mechanism to coordinate multiple subnetworks and improve the overall training efficiency. We devise a series of metrics to evaluate the quality of translated images during our experiments on the public GTA-V (a video game dataset, i.e., the virtual domain) and Cityscapes (a real-world dataset, i.e., the real domain) and achieved notably improved results, demonstrating the efficacy of the proposed model.
Read full abstract