Cross-view image-to-image translation refers to generating what the scenery looks like from different views at the same location, which involves simultaneous semantic and appearance translation. Previous works mainly focused on semantic translation that used a semantic map as additional information to guide the network to learn a good semantic mapping across different views. However, the appearance translation between two different views remains ambiguous due to the huge differences in distance and view angle, not to mention multi-modal translation. In this paper, we propose a novel end-to-end network called Cascaded Residual-based Progressive-refinement GAN (CRP-GAN). Specifically, an aerial image and a semantic map are used to generate multi-modal refined panoramas progressively. There are three novelties in the CRP-GAN. Firstly, it is able to generate ground-level panoramas of a wide field of view rather than images with a limited field of view, by fully exploiting the spatial information of the aerial image. Secondly, the proposed model is capable of generating multi-modal cross-view images, which is different from previous one-to-one image translation. Thirdly, a novel cascaded image refinement strategy is proposed to synthesize images with more details and less blurs at each stage of refinement. We conducted extensive experiments on CVUSA and Dayton datasets for cross-view image-to-image translation.
Read full abstract