Abstract

We propose to improve the visual object tracking by introducing a soft mask based low-level feature fusion technique. The proposed technique is further strengthened by integrating channel and spatial attention mechanisms. The proposed approach is integrated within a Siamese framework to demonstrate its effectiveness for visual object tracking. The proposed soft mask is used to give more importance to the target regions as compared to the other regions to enable effective target feature representation and to increase discriminative power. The low-level feature fusion improves the tracker robustness against distractors. The channel attention is used to identify more discriminative channels for better target representation. The spatial attention complements the soft mask based approach to better localize the target objects in challenging tracking scenarios. We evaluated our proposed approach over five publicly available benchmark datasets and performed extensive comparisons with 39 state-of-the-art tracking algorithms. The proposed tracker demonstrates excellent performance compared to the existing state-of-the-art trackers.

Highlights

  • Visual Object Tracking (VOT) is a promising, attractive, and challenging field in computer vision with a wide range of real-world applications including robotics [1], autonomous vehicles [2], video understanding [3], surveillance, and security [4]

  • Is an active research area owing to the challenges such as occlusion, the presence of various types of noise, appearance and scale variations of the target, environmental changes, motion blur, illumination variations, and background clutter

  • We propose a soft mask feature fusion mechanism to highlight the full target region compared to the background region

Read more

Summary

Introduction

Visual Object Tracking (VOT) is a promising, attractive, and challenging field in computer vision with a wide range of real-world applications including robotics [1], autonomous vehicles [2], video understanding [3], surveillance, and security [4]. Siamese trackers including [30,31,32,33,34] are computationally efficient but exhibit performance degradation under many scenarios These trackers learn similarity by training offline on the large benchmarks and do not learn the most discriminative features for a specific target, which reduces the tracking performance. We propose to integrate two different attention mechanisms in the Siamese tracking framework to emphasize discriminative channels and important spatial features in the latent space. We propose a Soft-mask with Channel and Spatial attentional Siamese (SCS-Siam) tracking framework that learns both effective and discriminative features. Soft mask feature fusion with dual attention is integrated within a Siamese tracking framework using a skip connection to enhance the tracker ability to better discriminate target from the background.

Related Work
Deep Feature-Based Trackers
Siamese Network-Based Trackers
Attention Mechanism-Based Trackers
Proposed SCS-Siam Network
Baseline SiameseFC Tracker
Soft-Mask Feature Fusion
Channel Attention Module
Spatial Attention Module
Network Training
Implementation Details
Datasets and Evaluation Metrics
Experiments on OTB2015
Experiments on TC128 and UAV123
Experiments on VOT2016 and VOT2017
Ablation Study
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call