Visual object tracking is an important topic in computer vision, which has successfully utilized pretrained convolutional neural networks, such as VGG and ResNet. However, the features extracted by these pretrained models are high dimensional, and the redundant feature channels reduce target localization and scale estimation precision, leading to tracking drifting. In this paper, a novel visual object tracking method, called learning enhanced feature responses tracking (LEFRT), is proposed, which adopts the target-specific features to enhance target localization and scale estimation responses. First, a channel attention module, called target-specific network (TSNet), is presented to reduce the redundant feature channels. Secondly, the scale estimation network (SCENet) is introduced to extract spatial structural features to generate a more precise response for the scale estimation. Extensive experiments on six tracking benchmarks, including LaSOT, GOT-10k, TrackingNet, OTB-2013, OTB-2015, and TC-128, demonstrate that the proposed algorithm can effectively improve the precision and speed of visual object tracking. LEFRT achieves 90.4% precision and a 71.2% success rate on the OTB-2015 dataset, improving the tracking methods based on the pretrained features.
Read full abstract