Recently, siamese-based trackers have achieved significant successes. However, those trackers are restricted by the difficulty of learning consistent feature representation with the object. To address the above challenge, this paper proposes a novel siamese implicit region proposal network with compound attention for visual tracking. First, an implicit region proposal (IRP) module is designed by combining a novel pixel-wise correlation method. This module can aggregate feature information of different regions that are similar to the pre-defined anchor boxes in Region Proposal Network. To this end, the adaptive feature receptive fields then can be obtained by linear fusion of features from different regions. Second, a compound attention module including a channel and non-local attention is raised to assist the IRP module to perform a better perception of the scale and shape of the object. The channel attention is applied for mining the discriminative information of the object to handle the background clutters of the template, while non-local attention is trained to aggregate the contextual information to learn the semantic range of the object. Finally, experimental results demonstrate that the proposed tracker achieves state-of-the-art performance on six challenging benchmark tests, including VOT-2018, VOT-2019, OTB-100, GOT-10k, LaSOT, and TrackingNet. Further, our obtained results demonstrate that the proposed approach can be run at an average speed of 72 FPS in real time.
Read full abstract