Abstract

Convolution Neural Networks (CNNs) based trackers achieve excellent tracking performance on tracking accuracy and speed. Feature extraction from the target template and search regions is a key part in visual tracking. Recently, existing feature subnetworks combine CNNs with channel attention for feature extraction. However, some existing feature subnetworks do not make the best of the target location dependencies, which result the target location dependency information lost in extracted target features. In this work, we design a novel feature extraction subnetwork with local temporal adaptive modules to obtain location sensitive importance maps, which effectively capture the diverse motion information and highlight the target location information. The target location dependency information is fully utilized to obtain more accurate target location information of the target template and search region in feature extraction subnetwork. The feature extraction subnetwork also fully exploits the local temporal semantics. Furthermore, we learn an interactive module in the template branch, which further captures the non-linear cross-channel interaction and channel-wise dependencies by combining every channel and its k neighbors. The template branch further utilizes cross-channel interactions for capturing the channel dependencies. The interactive module only increases a little extra computational burden. Comparing with other attention modules for visual tracking, this interactive module is lightweight. We propose a novel tracking framework, which mainly includes the designed feature extraction subnetwork and the interactive learning module. We evaluate the proposed tracker on GOT-10k, UAV123, DTB70, NFS, OTB-100, VOT2018, LaSOT and VOT-RGBT2019 benchmarks against advanced trackers, achieving leading performance with 60 FPS tracking speed.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call