Abstract

The latest Vision Transformer (ViT) has stronger contextual feature representation capability than existing convolutional neural networks, and thus has the potential to depict the remote sensing scenes, which usually have more complicated object distribution and spatial arrangement than ground image scenes. However, recent researches reflect that while ViT learns global features, it also ignores the key local features, which poses a bottleneck for understanding remote sensing scenes. In this letter, we tackle this challenge by proposing a novel Multi-Instance Vision Transformer (MITformer). Its originality mainly lies in the classic multiple instance learning (MIL) formulation, where each image patch embedded in ViT is regarded as an instance and each image is regarded as a bag. The benefit of designing ViT under MIL formulation is straight-forward, as it helps highlight the feature response of key local regions of remote sensing scenes. Moreover, to enhance the feature propagation of local features, attention based Multi-layer Perceptron (MLP) head is embedded at the end of each encoder unit. Last but not least, to minimize the potential semantic prediction differences between the classic ViT and our MIL head, a semantic consistency loss is designed. Experiments on three remote sensing scene classification benchmarks show that our proposed MITformer outperforms existing state-of-the-art methods and validate the effectiveness of each component in our MITformer.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call