The tracking-by-detection paradigm currently dominates multiple target tracking algorithms. It usually includes three tasks: target detection, appearance feature embedding, and data association. Carrying out these three tasks successively usually leads to lower tracking efficiency. In this paper, we propose a one-stage anchor-free multiple task learning framework which carries out target detection and appearance feature embedding in parallel to substantially increase the tracking speed. This framework simultaneously predicts a target detection and produces a feature embedding for each location, by sharing a pyramid of feature maps. We propose a deformable local attention module which utilizes the correlations between features at different locations within a target to obtain more discriminative features. We further propose a task-aware prediction module which utilizes deformable convolutions to select the most suitable locations for the different tasks. At the selected locations, classification of samples into foreground or background, appearance feature embedding, and target box regression are carried out. Two effective training strategies, regression range overlapping and sample reweighting, are proposed to reduce missed detections in dense scenes. Ambiguous samples whose identities are difficult to determine are effectively dealt with to obtain more accurate feature embedding of target appearance. An appearance-enhanced non-maximum suppression is proposed to reduce over-suppression of true targets in crowded scenes. Based on the one-stage anchor-free network with the deformable local attention module and the task-aware prediction module, we implement a new online multiple target tracker. Experimental results show that our tracker achieves a very fast speed while maintaining a high tracking accuracy.
Read full abstract