Visual tracking is a fundamental task in computer vision, which extracts the target context descriptions and features from the first image frame and tracks the target in subsequent frames with the updated target appearance models and the updated tracking models accordingly. The existing deficiencies of deep learning based visual tracker are: precision-speed dilemma, inadequate context usage and spawning tracking noises. To achieve real-time and accurate tracking with deep learning modules under tracking-by-detection framework, this paper introduced a discriminative target predictor based on temporal-scene attention context enhancement and candidate matching mechanism (ACDP). The contributions of ACDP include: 1) a vision transformer based temporal context enhancer is suggested by formulating the temporal context enhanced feature extractor via multi-head attention mechanism; 2) a target state propagation scheme based scene context enhancer is proposed via constructing the target state matrix and corresponding state forward propagation techniques; 3) a joint prediction method is provided to take full usage of the enhanced context through candidate selection and matching mechanism based on dust-bin conception network; 4) theoretical proofs for vision transformer acceleration scheme and the error bounds under successive tracking failure circumstances are given. In addition, the extensive experiments under OTB100, UAV123, NFS, AVisT and VOT2018 benchmarks evaluate the favorable performances of ACDP and indicate that: 1) the proposed modules in ACDP can achieve comparable tracking results even under difficult sequences and scenarios against other 21 state-of-the-art trackers; 2) the ACDP performs 35 FPS tracking speed on the 5 benchmarks averagely.
Read full abstract