Abstract

Siamese networks have drawn increasing interest in the field of visual object tracking due to their balance of precision and efficiency. However, Siamese trackers use relatively shallow backbone networks, such as AlexNet, and therefore do not take full advantage of the capabilities of modern deep convolutional neural networks (CNNs). Moreover, the feature representations of the target object in a Siamese tracker are extracted through the last layer of CNNs and mainly capture semantic information, which causes the tracker's precision to be relatively low and to drift easily in the presence of similar distractors. In this paper, a new nonpadding residual unit (NPRU) is designed and used to stack a 22-layer deep ResNet, referred as ResNet22. After utilizing ResNet22 as the backbone network, we can build a deep Siamese network, which can greatly enhance the tracking performance. Considering that the different levels of the feature maps of the CNN represent different aspects of the target object, we aggregated different deep convolutional layers to make use of ResNet22's multilevel feature maps, which can form hyperfeature representations of targets. The designed deep hyper Siamese network is named DHSiam. Experimental results show that DHSiam has achieved significant improvement on multiple benchmark datasets.

Highlights

  • Visual object tracking is a fundamental problem of computer vision

  • The backbone network used in these trackers is still the classic AlexNet [10], rather than modern deep convolutional neural networks (CNNs) such as residual network (ResNet) [11]

  • This paper addresses tracking issues by designing new residual modules and architectures apply the powerful capabilities of deeper backbone networks in Siamese trackers

Read more

Summary

Introduction

Visual object tracking is a fundamental problem of computer vision. It aims to estimate the position of a specified target object in a changing video sequence, given its initial location. The backbone network used in these trackers is still the classic AlexNet [10], rather than modern deep CNNs such as ResNet [11]. These deep networks have proven to have more powerful feature extraction and generalization capabilities. This phenomenon shows that straightforwardly increasing the network depth has a negative impact on tracker performance

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.