Abstract

Generic image classification has been widely studied in the past decade. However, for the bird-view aerial images, aerial scene classification remains challenging due to the dramatic variation of the scale and object size. Existing methods usually learn the aerial scene representation from the convolutional neural networks (CNN), which focus on the local response of an image. In contrast, the recently-developed vision transformers (ViT) can learn stronger global representation for aerial scenes, but are not qualified enough to highlight the key objects in an aerial scene due to the dramatic size and scale variation. To address this challenge, in this paper, we propose a local-global interactive vision transformer (LG-ViT) for this task. It is based on our deliberately designed local-global feature interactive learning scheme, which intends to jointly utilize the local-wise and global-wise feature representations. To realize the learning scheme in an end-to-end manner, the proposed LG-ViT consists of three key components, namely, local-global feature extraction, local-global feature interaction, and local-global semantic constraints. Extensive experiments on three aerial scene classification benchmarks, namely, UCM, AID and NWPU, demonstrate the effectiveness of the proposed LG-ViT against the state-of-the-art methods. The effectiveness of each component and generalization capability is also validated.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.