Learning Enhanced Feature Responses for Visual Object Tracking.

Runqing Zhang,Chunxiao Fan,Yue Ming

doi:10.1155/2022/1241687

Abstract

Visual object tracking is an important topic in computer vision, which has successfully utilized pretrained convolutional neural networks, such as VGG and ResNet. However, the features extracted by these pretrained models are high dimensional, and the redundant feature channels reduce target localization and scale estimation precision, leading to tracking drifting. In this paper, a novel visual object tracking method, called learning enhanced feature responses tracking (LEFRT), is proposed, which adopts the target-specific features to enhance target localization and scale estimation responses. First, a channel attention module, called target-specific network (TSNet), is presented to reduce the redundant feature channels. Secondly, the scale estimation network (SCENet) is introduced to extract spatial structural features to generate a more precise response for the scale estimation. Extensive experiments on six tracking benchmarks, including LaSOT, GOT-10k, TrackingNet, OTB-2013, OTB-2015, and TC-128, demonstrate that the proposed algorithm can effectively improve the precision and speed of visual object tracking. LEFRT achieves 90.4% precision and a 71.2% success rate on the OTB-2015 dataset, improving the tracking methods based on the pretrained features.

Highlights

Visual object tracking is one of the fundamental tasks in computer vision, widely used in the civil and military fields, such as image segmentation [1], intelligent transportation [2], object detection [3], and human-computer interaction [4].Recently, pretrained deep features bring state-of-the-art performance to existing trackers, effectively separating foreground objects from the background
At the same time, tracking methods based on the correlation filters utilized the convolutional features and the a priori scale coefficients to estimate the shape of the targets. e a priori scale coefficients are set as the discrete constant term parameters, which leads to the precision limit. erefore, it is of great importance to exploit more compact features to represent specific targets
We compare the precision score and success rate obtained by our learning enhanced feature responses tracking (LEFRT) and several state-of-the-art tracking methods including SiamFC++ [38], ASRCF [24], ARCF [39], UDT [20], ECO [40], convolutional neural network (CNN)-SVM [41], fDSST [13], ATTF [21], TADT [42], MLT [43], CFNet [44], SiamFC [16], Siam-tri [45], EACOFT [22], and Dinesh et al [23]

Summary

Introduction

Visual object tracking is one of the fundamental tasks in computer vision, widely used in the civil and military fields, such as image segmentation [1], intelligent transportation [2], object detection [3], and human-computer interaction [4]. At the same time, tracking methods based on the correlation filters utilized the convolutional features and the a priori scale coefficients to estimate the shape of the targets. We propose a novel visual object tracking algorithm called learning enhanced feature responses tracking (LEFRT), as shown, including the TSNet and the SCENet. We proposed the TSNet to reduce the redundant feature channels for the tracking methods based on the correlation filters and the pretrained features. (1) We propose the TSNet to generate channel attention and select the effective channels for arbitrary targets, significantly reducing the redundant feature channels and locating the specific target more precisely. (2) 2D RNN feature is utilized to represent the spatial structure of the target for scale estimation in the SCENet, which describes the spatial relationship and enhances the response of the target’s boundary.

Related Work

Experiments

Methods