Abstract

Visual attention has recently achieved great success and wide application in deep neural networks. Existing methods based on Siamese network have achieved a good accuracy–efficiency trade-off in visual tracking. However, the training time of Siamese trackers becomes longer for the deeper network and larger training data. Further, Siamese trackers cannot predict the target location well in fast motion, full occlusion, camera motion, and similar object scenarios. Due to these difficulties, we develop an end-to-end Siamese attention network for visual tracking. Our approach is to introduce an attention branch in the region proposal network that contains a classification branch and a regression branch. We perform foreground–background classification by combining the scores of the classification branch and the attention branch. The regression branch predicts the bounding boxes of the candidate regions based on the classification results. Furthermore, the proposed tracker achieves the experimental results comparable to the state-of-the-art tracker on six tracking benchmarks. In particular, the proposed method achieves an AUC score of 0.503 on LaSOT, while running at 40 frames per second (FPS).

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call