Abstract

Urban scene classification is an interesting area in computer vision. The task involves classifying a scene from a pair of aerial-view and ground-view images. Existing approaches have considered single-view and Multiview methods. Multiview approaches have shown to be more robust than single-view. However, most existing Multiview approaches neglect the disparity in the resolution of both views. The aerial-view images are captured with sophisticated high-resolution remote sensing devices. While the ground-view images are captured from closer perspectives with lower resolution. This paper proposed a Multiview scene classification (MuSC) model that caters to the resolution disparity in both views. MuSC introduces a Fourier convolution network (FCN) that is robust to variation of resolution in the cross-view images. The FCN is designed to extract local features (in spatial domain) and global features (in spectral domain). The proposed MuSC has a two-stage classifier. The first stage trains a discriminative view-specific network and classifies each view separately. However, the outputs from each view-specific network are projected into a unified subspace and mutual agreement between them is incentivized through contrastive learning. The second stage integrates the predictions from each view-specific network and trains a unified classifier for final prediction. This integration encourages cross-view complementarity. MuSC is evaluated on two datasets, AiRound and CV-BrCT. Several experiments are conducted with different settings. The results demonstrate that MuSC outperforms existing state-of-the-art models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call