Abstract

AbstractThe image‐to‐video (I2V) person re‐identification (Re‐ID) is a cross‐modality pedestrian retrieval task, whose crux is to reduce the large modality discrepancy between images and videos. To this end, this paper proposes to predict the following video frames from a single image. Thus, the I2V person Re‐ID can be transformed to video‐to‐video (V2V) Re‐ID. Considering that predicting video frames from a single image is an ill‐posed problem, this paper proposes two strategies to improve the quality of the predicted videos. First, a pose‐guided video prediction pipeline is proposed. The given single image and pedestrian pose are encoded via image encoder and pose encoder, respectively; then, the image feature and pose feature are concatenated as the input of the video decoder. The authors minimize the difference between the predicted video and true video, and simultaneously minimize the difference between the true pose and predicted pose. Second, the conditional adversarial training strategy is employed to generate high‐quality video frames. Specifically, the discriminator takes the source image as condition and distinguishes whether the input frames are fake or true following frames of the source image. Experimental results demonstrate that the pose‐guided adversarial video prediction can effectively improve accuracy of I2V Re‐ID.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call