Abstract

Most of the existing Siamese-type trackers usually adopt pre-defined anchor boxes or anchor-free schemes to accurately estimate the bounding box of targets. Unfortunately, they suffer from complicated hand-designed components and tedious post-processings. It is not easy to adjust parameters for unique scenes in real applications. So, we propose a new scheme by formulating visual tracking as a direct set prediction problem to alleviate this issue. The main component is a transformer attached to the Siamese-type feature extraction networks. Thus, our new framework can be summarized as Siamese Network with Transformers (SiamTFR). With a fixed small set of learned object queries, we force the final set of predictions via bipartite matching, significantly reducing hyper-parameters associated with the candidate boxes. Due to the unique predictions of this framework, we significantly ease the heavy burden of hyper-parameters search of post-processings in visual tracking. Extensive experiments on visual tracking benchmarks, including GOT-10K, demonstrate that SiamTFR achieves competitive performance and runs at 50 FPS. Specifically, SiamTFR outperforms leading anchor-based tracker SiamRPN++ in the GOT-10K benchmark, confirming its effectiveness and efficiency. Furthermore, SiamTFR is deployed on the embedded device in which the algorithm can be run at 30FPS or 54FPS with TensorRT meeting the real-time requirements. In addition, we design the complete tracking system demo that can work in the real road to narrow the gap between the academic models and industrial deployments.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call