Abstract

Video person re-identification has attracted much attention in recent years. It aims to match image sequences of pedestrians from different camera views. Previous approaches usually improve this task from three aspects, including: 1) selecting more discriminative frames; 2) generating more informative temporal representations; and 3) developing more effective distance metrics. To address the above issues, we present a novel and practical deep architecture for video person re-identification termed self-and-collaborative attention network (SCAN), which adopts the video pairs as the input and outputs their matching scores. SCAN has several appealing properties. First, SCAN adopts a non-parametric attention mechanism to refine the intra-sequence and inter-sequence feature representation of videos and outputs self-and-collaborative feature representation for each video, making the discriminative frames aligned between the probe and gallery sequences. Second, beyond the existing models, a generalized pairwise similarity measurement is proposed to generate the similarity feature representation of video pair by calculating the Hadamard product of their self-representation difference and collaborative-representation difference. Thus, the matching result can be predicted by the binary classifier. Third, a dense clip segmentation strategy is also introduced to generate rich probe-gallery pairs to optimize the model. In the test phase, the final matching score of two videos is determined by averaging the scores of top-ranked clip-pairs. Extensive experiments demonstrate the effectiveness of SCAN, which outperforms the top-1 accuracies of the best-performing baselines on iLIDS-VID, PRID2011, and MARS datasets, respectively.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call