Abstract
Trackers based on fully-convolutional Siamese networks regard tracking as a process of learning a similarity function. By utilizing shallow networks and off-line training, Siamese trackers can achieve high tracking speed and perform well in some simple scenes. However, due to the less semantic information and the invariant template, Siamese trackers still have a gap compared with the state-of-the-art methods in complex scenes and other challenging problems (Occlusion, Deformation, etc.). In this paper, we propose a Siamese tracking algorithm with deep features and robust feature fusion (SiamDF). The improved ResNet-18 network is utilized to replace the traditional shallow network and extract the deep features with more semantic information. For eliminating the negative effect of padding and making better use of the deep network, the proposed algorithm adopts the spatial aware sampling strategy to overcome the strict translation invariance. Meanwhile, a final response map with high quality can be obtained by using the multi-layer feature fusion. Thus, the tracker can significantly reduce the impact of the distractors in complex scenes. In addition, an adaptive feature information fusion is adopted to update the template, so that the algorithm can adapt to various changes of the target appearance. Objective evaluation on the OTB100 dataset shows that the precision and the overlap success can reach 0.852 and 0.658 respectively. Moreover, the EAO value evaluated on the VOT2016 database can reach 0.336. These results demonstrate that our algorithm can effectively improve the tracking performance and perform favorably in both robustness and accuracy.
Highlights
As an important direction in the field of computer vision, visual tracking has been highly concerned by researchers all the time
For eliminating the negative effect of padding and making better use of the deep network, the proposed algorithm adopts the spatial aware sampling strategy to overcome the strict translation invariance, and a final response map with high quality can be obtained by using the multi-layer feature fusion
The tracker can significantly reduce the impact of the distractors in complex scenes
Summary
As an important direction in the field of computer vision, visual tracking has been highly concerned by researchers all the time. Even if the SiamRPN++ tracker [19] successfully introduces the deep architecture into the algorithm, the feature information of the multi-layer network has not been fused reasonably. For eliminating the negative effect of padding and making better use of the deep network, the proposed algorithm adopts the spatial aware sampling strategy to overcome the strict translation invariance, and a final response map with high quality can be obtained by using the multi-layer feature fusion. We adopt an adaptive feature information fusion to update the constant template in Siamese network This makes the tracker more adaptive to various changes of the target appearance. Inspired by the above studies, we propose a novel method to achieve multi-layer feature fusion, which can obtain a response map with high quality and significantly reduce the impact of the distractors in complex scenes. M∈M where M denotes the response map after cross-correlation, h [m] ∈ {+1, −1} and r [m] indicate the label and the score, m ∈ M is the position in response map M . l is the logistic loss:
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.