MultiRPN-DIDNet: Multiple RPNs and Distance-IoU Discriminative Network for Real-Time UAV Target Tracking

Li Zhuo,Jiafeng Li,Shiyu Zhang,Bin Liu,Hui Zhang

doi:10.3390/rs13142772

Li Zhuo, Jiafeng Li + Show 3 more

Open Access

https://doi.org/10.3390/rs13142772

Copy DOI

Journal: Remote Sensing	Publication Date: Jul 14, 2021
Citations: 5	License type: CC BY 4.0

Affiliation: Beijing University of Technology

Abstract

Target tracking in low-altitude Unmanned Aerial Vehicle (UAV) videos faces many technical challenges due to the relatively small sizes, various orientation changes of the objects and diverse scenes. As a result, the tracking performance is still not satisfactory. In this paper, we propose a real-time single-target tracking method with multiple Region Proposal Networks (RPNs) and Distance-Intersection-over-Union (Distance-IoU) Discriminative Network (DIDNet), namely MultiRPN-DIDNet, in which ResNet50 is used as the backbone network for feature extraction. Firstly, an instance-based RPN suitable for the target tracking task is constructed under the framework of Simases Neural Network. RPN is to perform bounding box regression and classification, in which channel attention mechanism is integrated to improve the representative capability of the deep features. The RPNs built on the Block 2, Block 3 and Block 4 of ResNet50 output their own Regression (Reg) coefficients and Classification scores (Cls) respectively, which are weighted and then fused to determine the high-quality region proposals. Secondly, a DIDNet is designed to correct the candidate target’s bounding box finely through the fusion of multi-layer features, which is trained with the Distance-IoU loss. Experimental results on the public datasets of UAV20L and DTB70 show that, compared with the state-of-the-art UAV trackers, the proposed MultiRPN-DIDNet can obtain better tracking performance with fewer region proposals and correction iterations. As a result, the tracking speed has reached 33.9 frames per second (FPS), which can meet the requirements of real-time tracking tasks.

Highlights

In recent years, due to their many outstanding advantages in performance and cost, unmanned aerial vehicles (UAVs) have increasingly been deployed in many fields, such as security monitoring, disaster relief, agriculture, military equipment, sports and entertainments, etc
The sequence length varies from 1717 to 5527 frames. These sequences are labeled with 12 different attributes, namely Scale Variation (SV), Aspect Ratio Change (ARC), Camera Motion (CM), Full Occlusion (FOC), Illumination Variation (IV), Fast Motion (FM), Low Resolution (LR), Similar Object (SOB), Out-of-View (OV), Partial Occlusion (POC), Background Clutter (BC), and Viewpoint Change (VC)
In order to verify the performance of the proposed method more comprehensively, we display the tracking results obtained by the top five trackers with reference to the Area Under Curve (AUC)

Summary

Introduction

Due to their many outstanding advantages in performance and cost, unmanned aerial vehicles (UAVs) have increasingly been deployed in many fields, such as security monitoring, disaster relief, agriculture, military equipment, sports and entertainments, etc. A huge amount of visual data has been produced, and the demand for intelligent processing of UAV videos has increased significantly. Due to the release of new benchmark datasets and the improved methodologies, singletarget tracking has become a research hotspot, and the related work has made considerable advances. Filter (DCF) and trackers based on deep learning. Minimum Output Sum of Squared Error (MOSSE) is one of the most representative trackers based on DCF [1]. These kind of trackers have fast tracking speed and are easy to transplant to the embedded hardware platform for real-time processing, but the tracking accuracy is relatively low. It is difficult for them to meet the high-accuracy tracking requirements.

Methods

Results

Conclusion