Abstract

An important cue for multi-pedestrian tracking in video is the consistent appearance of an individual for quite a while. In this paper, we address multi-pedestrian tracking by learning a robust appearance model from the paradigm of tracking by detection. To separate detections of different pedestrians while assembling detections of the same pedestrian, we take advantage of the cue of consistent appearance and exploit three types of evidence from the recent, past and near-future. Existing online approaches only exploit the detection-to-detection and sequence-to-detection metrics, which focus on the recent and past appearance patterns respectively, while the future pedestrian appearance is simply ignored. This drawback is remedied in this paper by further considering the sequence-to-sequence metric, which resorts to near-future appearance presentation. Adaptive combination weights are learned to fuse these three different metrics. Moreover, we propose a novel Focal Triplet Loss to make the model focus more on hard examples than the easy ones. We demonstrate that this can significantly enhance the discriminating power of the model compared with treating every sample equally. Effectiveness and efficiency of the proposed method is verified by conducting comprehensive ablation studies and comparing with many competitive (offline/online/near-online) counterparts on the MOT16 and MOT17 Challenges.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call