Abstract

The multi-label classification problem in Unmanned Aerial Vehicle (UAV) images is particularly challenging compared to single-label classification due to its combinatorial nature. To tackle this issue, we propose in this paper a deep learning approach based on encoder-decoder neural network architecture with channel and spatial attention mechanisms. Specifically, the encoder module which is based on a pre-trained convolutional neural network (CNN) has the task to transform the input image to a set of feature maps using an opportune feature combination. To improve the feature representation further, this module incorporates a squeeze excitation (SE) layer for modelling the interdependencies between the channels of the feature maps. The decoder module which is based on a long short terms memory (LSTM) network has the task of generating, in a sequential way, the classes present in the image. At each time step, it predicts the next class-label by aligning its hidden state to the corresponding region in the image by means of an adaptive spatial attention mechanism. The experiments carried out on two UAV datasets with a spatial resolution of 2-cm show that our method is promising in predicting the labels present in the image while attending the relevant objects in the image. Additionally, it is able to provide better classification results compared to state-of-the-art methods.

Highlights

  • The increase adoption of unmanned aerial vehicles (UAVs), commonly known as drones have proven their effectiveness in collecting images with extremely high spatial details over inaccessible areas and limited coverage zones due to their small size and fast deployment

  • We propose an alternative solution based on encoder-decoder neural network architecture with channel and spatial attention mechanisms

  • We evaluated the proposed attention network on two UAV datasets acquired over the faculty of science of the University of Trento (Italy) and near the city of Civezzano (Italy) on October 2011 and 2012 by means of a UAV equipped with imaging sensors spanning the visible range (Figure 4)

Read more

Summary

Introduction

The increase adoption of unmanned aerial vehicles (UAVs), commonly known as drones have proven their effectiveness in collecting images with extremely high spatial details over inaccessible areas and limited coverage zones due to their small size and fast deployment. As the network goes deeper, it uses high dimensional representations using two inception modules of type C referred as 2×Inception Module C (Figure 1) This network includes more improvements in the architecture compared to the original GoogLeNet network (inception-v1) which was the winner of the ILSVRC14 (ImageNet Large Scale Visual Recognition Competition). These improvements include; 1) the RMSProp optimizer, 2) Factorized 7 × 7 convolutions, 3) BatchNormalization in the auxiliary classifiers, and 4) Label Smoothing, which is a type of a regularizing component added to the loss formula that prevents the network from becoming too confident about a class and prevents overfitting

Objectives
Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call