Abstract

Feature extraction in remote sensing is a challenging yet crucial operation for scene classification because of cloud cover and overlapping edges present in the data. Many architectures have been solely used as a backbone for feature extraction in complex computer vision tasks such as object detection and semantic segmentation. Though the remote sensing literature has compared deep learning models for scene classification, the comparison between different transformer-based architectures and convolution-based architectures has not been systematically addressed in the literature. Thus, this work comprehensively analyses different deep learning architectures on multiple scene classification datasets to understand the features and weigh the advantages of one or more functional connections in different convolutional neural networks. It has been done using five open-source benchmark datasets: UC-Merced land use, WHU-RS19, Optimal-31, RSI-CB256, and MLRSNet. Feature extraction for remote sensing natural scene classification is performed using pre-trained ImageNet22 k weights of convolution-based architectures such as VGG-16, ResNet50, EfficientNetB3 and ConvNeXt, and transformer architectures such as Vision transformers (ViT) and Swin transformers. Further, the networks are fine-tuned using the LinBnDrop block from the fastai framework before scene classification using the softmax layer. Our work obtains a new benchmark for all datasets on a 90:10 train-test split. An explanation to understand the use of each architecture based on the available data and its application is discussed in this work. The analysis of 42 experiments conducted in this work will help the research community analyze the scene classification datasets and give them better insights into fine-tuning applications.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call