Deeper Siamese Network With Stronger Feature Representation for Visual Tracking

Chaoyi Zhang,Howard Wang,Jiwei Wen,Li Peng

doi:10.1109/access.2020.3005511

Chaoyi Zhang, Howard Wang + Show 2 more

Open Access

https://doi.org/10.1109/access.2020.3005511

Copy DOI

Journal: IEEE Access	Publication Date: Jan 1, 2020
Citations: 44	License type: CC BY 4.0

Affiliation: Jiangnan University, University of Auckland

Abstract

Siamese network based visual tracking has drawn considerable attention recently due to the balanced accuracy and speed. This type of method mostly trains a relatively shallow twin network offline, and measures the similarity online using cross-correlation operation between the feature maps generated by the last convolutional layer of the target and search regions to locate the object. Nevertheless, a single feature map extracted from the last layer of shallow networks is insufficient to describe target appearance, as well as sensitive to the distractors, which could mislead the similarity response map and make the tracker easily drift. To enhance the tracking accuracy and robustness while maintaining the real-time speed, based on the above tracking paradigm, three improvements including reform of backbone network, fusion of hierarchical features and utilization of channel attention mechanism, have been made in this paper. Firstly, we introduce a modified deeper VGG16 backbone network, which could extract more powerful features contributing to distinguishing the target from distractors. Secondly, we fuse diverse features extracted from deep layers and shallow layers to take advantage of both semantic and spatial information of the target. Thirdly, we incorporate a novel lightweight residual channel attention mechanism into the backbone network, which expands the weight gap between different channels and helps the network pay more attention on dominant features. Extensive experimental results on OTB100 and VOT2018 benchmarks demonstrate that our tracker performs better in accuracy and efficiency against several state-of-the-art methods in real-time scenarios.

Highlights

Visual tracking, as one of the fundamental tasks in computer vision, is to estimate the location of an arbitrary object in a video sequence, given only target position in the first frame
In order to further efficiently strengthen the robustness of extracted features in the complex scenario during visual tracking, for instance, the background with many noises could lead to a drifted tracker, a novel lightweight residual channel attention mechanism is proposed
EXPERIMENTAL RESULTS AND ANALYSIS we first provide the implementation details including the detailed backbone configuration of HA-SiamVGG, and evaluate the performance of our tracker on OTB100 [40] and VOT2018 [41] benchmarks, lastly, we carry out ablative studies to analyze the effectiveness of the proposed feature fusion strategy and channel attention mechanism

Summary

INTRODUCTION

As one of the fundamental tasks in computer vision, is to estimate the location of an arbitrary object in a video sequence, given only target position in the first frame. The CNNs utilized by the trackers based on the correlation filter are originally designed for image classification tasks, which means the extracted feature representation of the target is not fully suitable for visual tracking These deep trackers cannot run in real time. We take full advantage of the deeper network VGG to learn a more discriminative feature representation, equipped with a hierarchical feature fusion strategy and a lightweight residual channel attention mechanism for real-time tracking. Gao et al [35] utilize a novel hierarchical attentional module with long short-term memory and multi-layer perceptrons to effectively facilitate visual pattern emphasis, and learn the reinforced attentional representation for accurate target object discrimination and localization In terms of these insights, this paper proposes a lightweight residual channel attention module to help the backbone extract more discriminative features. Our expectation is that the network can learn the robust weighting coefficients through training phase, instead of manually fine tuning on small test datasets

CHANNEL ATTENTION MECHANISM

EXPERIMENTAL RESULTS AND ANALYSIS

IMPLEMENTATION DETAILS

CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Deeper Siamese Network With Stronger Feature Representation for Visual Tracking

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

A Deep Hyper Siamese Network for Real-Time Object Tracking
Yongpeng Zhao ... Lasheng Yu
Transactions on Machine Learning and Artificial Intelligence | VOL. 8
Yongpeng Zhao, et. al.Yongpeng Zhao ... Lasheng Yu
30 Apr 2020
Transactions on Machine Learning and Artificial Intelligence | VOL. 8

Towards real-time object tracking with deep Siamese network and layerwise aggregation
Lasheng Yu ... Yongpeng Zhao
Signal, Image and Video Processing | VOL. 15
Lasheng Yu, et. al.Lasheng Yu ... Yongpeng Zhao
25 Jan 2021
Signal, Image and Video Processing | VOL. 15

Learning Channel Attention in Frequency Domain for Visual Tracking
Jun Wang ... Peiyun Zhang
-
Jun Wang, et. al.Jun Wang ... Peiyun Zhang
01 Dec 2021
01 Dec 2021

Learning Siamese Network with Top-Down Modulation for Visual Tracking
Yingjie Yao ... Wangmeng Zuo
-
Yingjie Yao, et. al.Yingjie Yao ... Wangmeng Zuo
01 Jan 2018
01 Jan 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Deeper Siamese Network With Stronger Feature Representation for Visual Tracking

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access