Abstract

Template based learning, particularly Siamese networks, has recently become popular due to balancing accuracy and speed. However, preserving tracker robustness against challenging scenarios with real-time speed is a primary concern for visual object tracking. Siamese trackers confront difficulties handling target appearance changes continually due to less discrimination ability learning between target and background information. This paper presents stacked channel-spatial attention within Siamese networks to improve tracker robustness without sacrificing fast-tracking speed. The proposed channel attention strengthens target-specific channels increasing their weight while reducing the importance of irrelevant channels with lower weights. Spatial attention is focusing on the most informative region of the target feature map. We integrate the proposed channel and spatial attention modules to enhance tracking performance with end-to-end learning. The proposed tracking framework learns what and where to highlight important target information for efficient tracking. Experimental results on widely used OTB100, OTB50, VOT2016, VOT2017/18, TC-128, and UAV123 benchmarks verified the proposed tracker achieved outstanding performance compared with state-of-the-art trackers.

Highlights

  • Visual object tracking is a fundamental and challenging task for a wide range of computer vision applications, including intelligent surveillance [1], autonomous vehicles [2], game analysis [3], and human-computer interface [4]

  • We were inspired by human visual perception, which does not require concentrating on the whole scene, but rather focuses on the specific object for perceiving informative parts to understanding the appropriate visual pattern [52]

  • The global max-pooling operation focuses on distinctive and finer object features, whereas global average pooling provides overall knowledge on the feature map for channel attention. After computing both pooling operations, we calculate individual multilayer perceptron (MLP) using an rectified linear unit (ReLU) layer to learn the non-linearity between two fully-connected layers with 128 and 512 nodes, respectively

Read more

Summary

INTRODUCTION

Visual object tracking is a fundamental and challenging task for a wide range of computer vision applications, including intelligent surveillance [1], autonomous vehicles [2], game analysis [3], and human-computer interface [4]. For example, DsimaM [7] learns background suppression and appearance variations from earlier frames using a fast transformation learning model; whereas DCFNet [6] integrates a discriminant correlation filter (DCF) within a lightweight architecture and drives back-propagation to adjust the DCF layer using the probability heat map of the target location. Models and results are available at https://github.com/maklachur/SCSAtt

RELATED WORK
SIAMESE NETWORK FOR FEATURE LEARNING
STACKED CHANNEL-SPATIAL ATTENTION
IMPLEMENTATION DETAILS
EXPERIMENTS
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call