Abstract

Visual tracking is fundamental in computer vision tasks. The Siamese-based trackers have shown surprising effectiveness in recent years. However, two points have been neglected: firstly, few of them focus on fusing the image level and semantic level features in neural networks, which usually resulting in tracking failure when differentiating the target from other distractors of the same class. Secondly, the robustness of the previous redetection scheme is limited by simply expanding the search region. To address these two issues, we propose a novel multilevel feature-weighted Siamese region proposal network tracker, which employs a feature fusion module to construct discriminative feature embedding and a similarity-based attention module to suppress the distractors in the search region. Furthermore, a color-based constraint module is presented to further suppress the distractors with the same class to the target. Finally, a well-designed global redetection scheme is built to handle long-term tracking tasks. The proposed tracker achieves state-of-art performance on a series of popular benchmarks, including object tracking benchmark 2013 (0.699 in success score), object tracking benchmark 2015 (0.700 in success score), visual object tracking 2017 (0.470 in expected average overlap score), and visual object tracking (0.485 in expected average overlap score).

Highlights

  • Visual tracking is a technique to track the target in an image sequence given the target’s bounding box in the first frame as the template

  • In contrast to the previous works, a feature fusion module (FFM) is designed to fuse all level features to a unified representation, which is used to encode the similarity between the template and each sliding window in the search region

  • To make full use of the color information lost in the deep network, a color-based constraint module (CCM) is proposed to suppress the network’s output

Read more

Summary

Introduction

Visual tracking is a technique to track the target in an image sequence given the target’s bounding box in the first frame as the template. They extract features of the template and a search frame by two convolution structures with shared weights and generate a similarity map between the template. The high-level features fail to represent the difference between the target and other objects with the same class Based on these features, the obtained similarity map is unreliable to distinguish the target and others. The features of different levels are useful in different conditions For this reason, we propose a multilevel feature-weighted Siamese region proposal network (MFW-SiamRPN). In contrast to the previous works, a feature fusion module (FFM) is designed to fuse all level features to a unified representation, which is used to encode the similarity between the template and each sliding window in the search region. Abundant ablation studies verify our point of view and the effectiveness of each module

Related work
Method
Experiments and evaluations
Methods
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.