End-to-end Network Embedding Unsupervised Key Frame Extraction for Video-based Person Re-identification

Ye Li,Xiaoyu Luo,Guangqiang Yin,Chao Li,Shaoqi Hou

doi:10.1109/icist52614.2021.9440586

Abstract

At present, regarding the task of video-based person re-identification, the input sequences have subtle differences and large redundancies because there are not enough effective interventions in the extraction of frame sequences. Although some studies have mentioned that key frame should be extracted first, they have not jointed the key frame extraction and the person re-identification. Consequently, it is difficult to evaluate whether the extracted key frames are effective for person re-identification. In this paper, we introduce an End-to-end Network Embedding Unsupervised Key Frame Extraction (EKEN) to address the above problems. First, we design a key frame extraction module and train it using pseudo labels generated by hierarchical clustering to extract key frames. Second, we embed the key frame extraction module into the person re-identification task. The results of the key frame extraction and the pedestrian re-recognition are fed back to each other in time. The instant feedback promotes the synchronization optimization of these two modules. The mAP achieved by our method in the MARS dataset is improved by 0.7%, 2.9%, 2.1% and 2.3% over the methods based on Random, Evenly, Cluster and Frame difference, respectively. Particularly, our method is more fit for the real-world application comparing to existing methods.

Full Text