Abstract

Discriminative correlation filters (DCF) have drawn increasing interest in visual tracking. In particular, a few recent works treat DCF as a special layer and add it into a Siamese network for visual tracking. However, most of them adopt shallow networks to learn target representations, which lack robust semantic information of deeper layers and make these works fail to handle significant appearance changes. In this paper, we design a novel Siamese network to fuse high-level semantic features and low-level spatial detail features for correlation tracking. Specifically, to introduce more semantic information into low-level features, we specially design a residual semantic embedding module to adaptively involve more semantic information from high-level features to guide the feature fusion. Furthermore, we adopt an effective and efficient channel attention mechanism to filter out noise information and make the network focus more on valuable features that are beneficial for visual tracking. The overall architecture is trained end-to-end offline to adaptively learn target representations, which are not only enabled to encode high-level semantic features and low-level spatial detail features, but also closely related to correlation filters. Experimental results on widely used OTB2013, OTB2015, VOT2016, TC-128, and UAV123 benchmarks show that our proposed tracker performs favorably against several state-of-the-art trackers.

Highlights

  • Visual object tracking is a fundamental research topic in computer vision and plays an important role for its various applications, e.g., vehicle navigation, robotics, surveillance, and so on

  • The trackers based on discriminative correlation filters (DCF) method [1], [2] have received significant attention due to their state-of the-art performance and high tracking speed

  • We propose a residual semantic embedding (RSE) module to adaptively introduce semantic information into low-level features, which contributes to reducing the gap in semantic levels and spatial resolution, and enhancing the fusion of low-level and high-level features

Read more

Summary

INTRODUCTION

Visual object tracking is a fundamental research topic in computer vision and plays an important role for its various applications, e.g., vehicle navigation, robotics, surveillance, and so on. The later work [64] adopts an anchor-free strategy to predict object bounding boxes, and fuses low-level and high-level features by concatenating multi-layer deep features along channel dimension for tracking. These trackers achieve state-of-the-art performance, the lack of online learning makes them hard to adapt to appearance variations of target. We construct a novel Siamese network and combine it with correlation filter (CF) layer to end-to-end learn the fusion of high-level semantic features and low-level detail features for visual tracking. We design a residual semantic embedding module and integrate it into the Siamese network, which can adaptively involve semantic information from high-level features to guide the fusion of high-level semantic features and low-level spatial details

RELATED WORK
DCF AND CORRELATION FILTER LAYER
CORRELATION NETWORK ARCHITECTURE
ONLINE TRACKING
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call