Deep Ensemble Object Tracking Based on Temporal and Spatial Networks

Zhaohua Hu,Huxin Chen,Gaofei Li

doi:10.1109/access.2020.2964100

Abstract

In recent years, correlation filtering and deep learning have achieved good performance in object tracking. Correlation filtering is an efficient and real-time method because its formula provides a fast solution in the Fourier domain, but it does not benefit from end-to-end training. Although deep learning is an effective method for learning object representations, training deep networks online with one or a few examples is challenging. To address these problems, we propose a deep ensemble object tracking algorithm that fuses temporal and spatial information to improve algorithm precision and robustness. The framework of our algorithm includes four aspects: feature extraction, a baseline network, a branch network and adaptive ensemble learning. Feature extraction extracts the general object representation. The baseline network integrates feature extraction and a correlation filtering algorithm into a convolutional neural network for end-to-end training. The branch network is composed of a temporal network and a spatial network. The temporal and spatial networks capture the object temporal and spatial information and further refine the object position. Our algorithm only needs an initial frame to train all networks. Adaptive ensemble learning compensates for the object information deficiency and improves tracking accuracy. Many experiments on tracking benchmark datasets demonstrate that our algorithm performs favourably compared with state-of-the-art tracking algorithms.

Highlights

Visual object tracking is a fundamental problem in the computer vision field
TRACA combines deep learning and correlation filtering; ACFN is a deep learning method based on an attention structure; CFNet and SiamFC are deep learning methods based on a Siamese network; SCT is a correlation filtering method based on an attention structure; Staple and spatially regularized correlation filter (SRDCF) are correlation filtering methods; and convolutional neural network (CNN)-SVM is a deep learning method
2) EXPERIMENTS ON VOT a: VOT2016 RESULTS We evaluate our tracker on the VOT2016 benchmark by comparing it with 14 released state-of-the-art trackers, including VITAL [47], efficient convolution operator (ECO)-HC [11], Staple_p [39], SiamRN [39], DNT [48], DeepSRDCF [49], MDNet_N [4], RFD_CF2 [50], SiamAN [39], deepMKCF [51], hierarchical convolutional features (HCFT) [26], kernelized correlation filter (KCF) [6], FIGURE 10

Summary

Introduction

Visual object tracking can be widely used in many practical systems, such as unmanned aerial vehicle (UAV) [1], video surveillance [2], and human-computer interaction [3] The foundation of this problem is developing a robust appearance model with extremely limited training data (usually the bounding box in the first frame). The multi-domain network (MDNet) tracker [4] uses video sequences from similar tracking benchmarks to pre-train a deep model and uses object benchmark sequences to fine-tune the learned model online. This method is prone to overfitting and consumes too much time for pre-training. A convolutional neural network (CNN) has been used as an online

Methods

Results

Conclusion