Abstract
Siamese network based visual tracking has drawn considerable attention recently due to the balanced accuracy and speed. This type of method mostly trains a relatively shallow twin network offline, and measures the similarity online using cross-correlation operation between the feature maps generated by the last convolutional layer of the target and search regions to locate the object. Nevertheless, a single feature map extracted from the last layer of shallow networks is insufficient to describe target appearance, as well as sensitive to the distractors, which could mislead the similarity response map and make the tracker easily drift. To enhance the tracking accuracy and robustness while maintaining the real-time speed, based on the above tracking paradigm, three improvements including reform of backbone network, fusion of hierarchical features and utilization of channel attention mechanism, have been made in this paper. Firstly, we introduce a modified deeper VGG16 backbone network, which could extract more powerful features contributing to distinguishing the target from distractors. Secondly, we fuse diverse features extracted from deep layers and shallow layers to take advantage of both semantic and spatial information of the target. Thirdly, we incorporate a novel lightweight residual channel attention mechanism into the backbone network, which expands the weight gap between different channels and helps the network pay more attention on dominant features. Extensive experimental results on OTB100 and VOT2018 benchmarks demonstrate that our tracker performs better in accuracy and efficiency against several state-of-the-art methods in real-time scenarios.
Highlights
Visual tracking, as one of the fundamental tasks in computer vision, is to estimate the location of an arbitrary object in a video sequence, given only target position in the first frame
In order to further efficiently strengthen the robustness of extracted features in the complex scenario during visual tracking, for instance, the background with many noises could lead to a drifted tracker, a novel lightweight residual channel attention mechanism is proposed
EXPERIMENTAL RESULTS AND ANALYSIS we first provide the implementation details including the detailed backbone configuration of HA-SiamVGG, and evaluate the performance of our tracker on OTB100 [40] and VOT2018 [41] benchmarks, lastly, we carry out ablative studies to analyze the effectiveness of the proposed feature fusion strategy and channel attention mechanism
Summary
As one of the fundamental tasks in computer vision, is to estimate the location of an arbitrary object in a video sequence, given only target position in the first frame. The CNNs utilized by the trackers based on the correlation filter are originally designed for image classification tasks, which means the extracted feature representation of the target is not fully suitable for visual tracking These deep trackers cannot run in real time. We take full advantage of the deeper network VGG to learn a more discriminative feature representation, equipped with a hierarchical feature fusion strategy and a lightweight residual channel attention mechanism for real-time tracking. Gao et al [35] utilize a novel hierarchical attentional module with long short-term memory and multi-layer perceptrons to effectively facilitate visual pattern emphasis, and learn the reinforced attentional representation for accurate target object discrimination and localization In terms of these insights, this paper proposes a lightweight residual channel attention module to help the backbone extract more discriminative features. Our expectation is that the network can learn the robust weighting coefficients through training phase, instead of manually fine tuning on small test datasets
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.