Abstract

We present a method by combining the similarity and semantic features of a target to improve tracking performance in video sequences. Trackers based on Siamese networks have achieved success in recent competitions and databases through learning similarity according to binary labels. Unfortunately, such weak labels result in limiting the discriminative ability of the learned feature, thus it is difficult to identify the target itself from the distractors that have the same class. The authors observe that the inter‐class semantic features benefit to increase the separation between the target and the background, even distractors. Therefore, they proposed a network architecture which uses both similarity and semantic branches to obtain more discriminative features for locating the target accuracy in new frames. The large‐scale ImageNet VID dataset is employed to train the network. Even in the presence of background clutter, visual distortion, and distractors, the proposed method still maintains following the target. They test their method with the open benchmarks OTB and UAV123. The results show that their combined approach significantly improves the tracking ability relative to trackers using similarity or semantic features alone.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call