Abstract

Visual object tracking, one of the main topics in computer vision, aims to chase a target object in every frame of the video sequences. In particular, Siamese-based network architectures have been adopted widely for visual object tracking due to their correlation-based nature. On the other hand, the features encoded from the target template and the search image in Siamese branches still suffer from ambiguities, which are driven by complicated real-world environments, e.g., occlusions and rotations. This paper proposes the Siamese feedback network for robust object tracking. The key idea of the proposed method is to encode target-relevant features accurately via the feedback block, which is defined by a combination of attention and refinement modules. Specifically, interdependent features are extracted through self- and cross-attention operations. Subsequently, such re-calibrated features are refined in both spatial and channel-wise manner. Those are fed back to the input of the feedback block again via the feedback loop. This is desirable because the high-level semantic information guides the feedback block to learn more meaningful properties of the target object and its surroundings. The experimental results show that the proposed method outperforms the state-of-the-art Siamese-based methods with a gain of 0.72% and 1.69% for the expected average overlap on the VOT2016 and VOT2018 datasets, respectively. Overall, the proposed method is effective for visual object tracking, even with complicated real-world scenarios.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call