Abstract
The anchor-free one-shot models, which localize the detections and extract embeddings by estimating center points in a single network have been proven highly effective in multi-object tracking (MOT). However, it is observed that the incomplete or unclear appearances of objects make the existing semantic feature aggregation in one-shot models less effective, which affects the performance of MOT. Moreover, these one-shot MOT models often generate wrong matches between detections and objects, because they ignore the influence of historical tracklet clues on objects. Motivated by these issues, we propose a novel hierarchical context-guided network for one-shot MOT, which performs the detection, embedding extraction, and object refinement by the hierarchical global-wise, patch-wise, and object-wise processing. Specifically, our method learns temporal and spatial context features in a global-wise and patch-wise manner to guide the multi-scale aggregation, so as to locate the area of interest and extract rich embeddings. In this way, the embedding of each detection owns both context relations besides semantic information, which reduces the loss of important information for tracked objects. At last, based on the learned context features, a context-guided object refinement module is designed to learn the tracklet embedding and produce refined objects in each frame, which can alleviate the erroneous matches between objects and detections. Extensive experiments conducted on several benchmarks, including 2D MOT2015, MOT17, and MOT20 datasets, demonstrate the effectiveness of our HCgNet.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have