Abstract
Recent advances in object tracking based on deep Siamese networks shifted the attention away from correlation filters. However, the Siamese network alone does not have as high accuracy as state-of-the-art correlation filter-based trackers, whereas correlation filter-based trackers alone have a frame update problem. In this paper, we present a Siamese network with spatially semantic correlation features (SNS-CF) for accurate, robust object tracking. To deal with various types of features spread in many regions of the input image frame, the proposed SNS-CF consists of—(1) a Siamese feature extractor, (2) a spatially semantic feature extractor, and (3) an adaptive correlation filter. To the best of authors knowledge, the proposed SNS-CF is the first attempt to fuse the Siamese network and the correlation filter to provide high frame rate, real-time visual tracking with a favorable tracking performance to the state-of-the-art methods in multiple benchmarks.
Highlights
Visual object tracking aims at estimating the position of an arbitrary target in a video sequence by establishing a correspondence between similar pixels of different frames [1,2,3]
Starting from correlation filter—based trackers, we quantitatively evaluated the proposed algorithm with 9 state-of-the-art trackers [3,12,27,28,29,30,31,32,33], considering the distance precision rate (DP) at 20 pixels, overlap success rate (OS) at 0.5, center location errors (CLE) and tracking speed, from 100 sequences of OTB-2015 [10] benchmark
Fusion–based results—We present results from combining state-of-the-arts of both correlation filter–based tracker [3] and Siamese network–based tracker [2] with direct combination, that is, with no modification, and with our proposed algorithm that includes the extraction of spatially semantic correlation features (SSF) and the learning of adaptive correlation filters (ACF)
Summary
Visual object tracking aims at estimating the position of an arbitrary target in a video sequence by establishing a correspondence between similar pixels of different frames [1,2,3]. Failure cases—In some challenging scenarios, our algorithm failed completely to locate to position of the target We suspect this is due to intense background clutter, appearance of many similar foreground images, not targets, and severe out-of-view. Severe out-of-view cases may be well addressed if our algorithm was equipped with a re-detection module, which will be our future research. They represent multiple foreground images similar to the target, severe out-of-view and sudden background clutter respectively
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.