Abstract

The region-based convolutional networks have shown their remarkable ability for object detection in optical remote sensing images. However, the standard CNNs are inherently limited to model geometric transformations due to the fixed geometric structures in its building modules. To address this, we introduce a new module named deformable convolution that is integrated into the prevailing Faster R-CNN. By adding 2D offsets to the regular sampling grid in the standard convolution, it learns the augmenting spatial sampling locations in the modules from target tasks without additional supervision. In our work, a deformable Faster R-CNN is constructed by substituting the standard convolution layer with a deformable convolution layer in the last network stage. Besides, top-down and skip connections are adopted to produce a single high-level feature map of a fine resolution, on which the predictions are to be made. To make the model robust to occlusion, a simple yet effective data augmentation technique is proposed for training the convolutional neural network. Experimental results show that our deformable Faster R-CNN improves the mean average precision by a large margin on the SORSI and HRRS dataset.

Highlights

  • Convolutional Neural Networks (CNNs) [1] have achieved flourishing success for visual recognition tasks, such as image classification [2], semantic segmentation [3], and object detection [4]

  • To evaluate the proposed deformable Faster RCNN with Transfer Connection Block (TCB) quantitatively, we compared it with the Average Precision (AP) values with four state-of-the-art CNN-based methods: (1) A rotation-invariant CNN (RICNN) model which considers rotation-invariant information with a rotation-invariant layer and other fine-tuned layers; (2) the SSD model with an input image size of 512 × 512 pixels; (3) the R-P-Faster RCNN [30] object detection framework; and (4) deformable R-FCN with the aspect ratio constrained Non-Maximum Suppression (NMS)

  • The proposed deformable Faster RCNN with TCB, which is fine-tuned on the ResNet-50 ImageNet pre-trained model, obtains the best mean AP value of 84.4% among all the object detection methods. It indicates that our deformable faster RCNN with TCB achieves the best AP values for most classes, except baseball diamond, harbor, and bridge

Read more

Summary

Introduction

Convolutional Neural Networks (CNNs) [1] have achieved flourishing success for visual recognition tasks, such as image classification [2], semantic segmentation [3], and object detection [4]. Modeling geometric variations or transformations in the scale of objects, pose, viewpoint, and part deformations is a key challenge in optical remote sensing visual recognition. In the past few decades, various methods have been developed for the detection of different types of objects in satellite and aerial images, such as buildings [5], storage tanks [6], vehicles [7], and airplanes [8]. They can be divided into four main categories: Template matching-based methods, knowledge-based methods, OBIA-based methods, and machine learning-based methods. Many recent approaches have formulated object detection as feature extraction and classification problems and have achieved significant improvements

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call