Abstract

Object detection in remote-sensing images (RSIs) is always a vibrant research topic in the remote-sensing community. Recently, deep-convolutional-neural-network (CNN)-based methods, including region-CNN-based and You-Only-Look-Once-based methods, have become the de-facto standard for RSI object detection. CNNs are good at local feature extraction but they have limitations in capturing global features. However, the attention-based transformer can obtain the relationships of RSI at a long distance. Therefore, the Transformer for Remote-Sensing Object detection (TRD) is investigated in this study. Specifically, the proposed TRD is a combination of a CNN and a multiple-layer Transformer with encoders and decoders. To detect objects from RSIs, a modified Transformer is designed to aggregate features of global spatial positions on multiple scales and model the interactions between pairwise instances. Then, due to the fact that the source data set (e.g., ImageNet) and the target data set (i.e., RSI data set) are quite different, to reduce the difference between the data sets, the TRD with the transferring CNN (T-TRD) based on the attention mechanism is proposed to adjust the pre-trained model for better RSI object detection. Because the training of the Transformer always needs abundant, well-annotated training samples, and the number of training samples for RSI object detection is usually limited, in order to avoid overfitting, data augmentation is combined with a Transformer to improve the detection performance of RSI. The proposed T-TRD with data augmentation (T-TRD-DA) is tested on the two widely-used data sets (i.e., NWPU VHR-10 and DIOR) and the experimental results reveal that the proposed models provide competitive results (i.e., centuple mean average precision of 87.9 and 66.8 with at most 5.9 and 2.4 higher than the comparison methods on the NWPU VHR-10 and the DIOR data sets, respectively) compared to the competitive benchmark methods, which shows that the Transformer-based method opens a new window for RSI object detection.

Highlights

  • Introduction published maps and institutional affilObject detection in remote-sensing images (RSIs) is used to answer one of the most basic questions in the remote-sensing (RS) community: What and where are the objects in the RSIs? In general, the objective of object detection is to build models to localize and recognize different ground objects of interest in highresolution RSIs [1]

  • It is obvious that the proposed T-Transformer for Remote-Sensing Object detection (TRD)-DA exhibits an intuitively satisfactory detection capability on the large-scale challenge data set

  • Transformer-based frameworks were explored for RSI object detection

Read more

Summary

Introduction

Introduction published maps and institutional affilObject detection in remote-sensing images (RSIs) is used to answer one of the most basic questions in the remote-sensing (RS) community: What and where are the objects (such as a ship, vehicle, or aircraft) in the RSIs? In general, the objective of object detection is to build models to localize and recognize different ground objects of interest in highresolution RSIs [1]. Object detection in remote-sensing images (RSIs) is used to answer one of the most basic questions in the remote-sensing (RS) community: What and where are the objects (such as a ship, vehicle, or aircraft) in the RSIs? Due to the fact that object detection is a fundamental task for the interpretation of high-resolution RSIs, a great number of methods have been proposed to handle the issue of RSI object detection in the last decade [2]. The traditional RSI object-detection methods focus on constructing effective features for objects of interest and training a classifier from a set of annotated RSIs. The traditional RSI object-detection methods focus on constructing effective features for objects of interest and training a classifier from a set of annotated RSIs They usually acquire object regions with sliding windows and try to recognize each region.

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call