Abstract

Vehicle tracking in the unmanned aerial vehicle (UAV) videos is a fundamental but vital computer vision task. It mainly consists of two key components, that is, detection and reidentification (ReID). Recently, one-shot trackers, which integrate detection and ReID in a unified network, have received significant attention for their fast-tracking speed. However, existing one-shot trackers typically utilize local information to distinguish the detected targets. Due to the lack of global relations, which are key cues for tracking, these methods struggle to identify targets in UAV videos accurately. To alleviate the above issue, we deliberately design an ReID head that combines nonlocal blocks and the transformer layer to capture the global semantic relation in this letter. First, we propose a novel pyramid fusion network (PFN) to obtain the pixel-wise relations of features at multiple levels and aggregate them into features with richer semantic information. Then, we present a channel-wise transformer enhancer (CTE) to model the dependencies among the channels of the feature map and predict fine-grained identity embeddings. Extensive experiments on VisDrone2021 and UAVDT benchmarks demonstrate that our tracker, namely global context embedding for vehicle tracking (GCEVT), achieves state-of-the-art tracking performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call