Abstract
Despite the great progress achieved, image-to-video person re-identification is still challenging in the cross-modal scenario. Currently, state-of-the-art approaches mainly concentrate on the task-specific data, neglecting the extra information from the different but related tasks. In this paper, we propose an end-to-end neural network framework for image-to-video person re-identification with cross-modal embeddings learned from extra information. Concretely speaking, cross-modal embedding layers from image captioning and video captioning models, are incorporated to learn common latent embeddings for multiple modalities. The learned multimodal embeddings are expected to focus on person’s prominent distinctions, due to textual descriptive information generally paying close attention to person’s explicit characteristics. Apart from that, our proposed framework resorts to CNNs and LSTMs for extracting visual and spatiotemporal features, and combines the strengths of identification and verification model to improve the discriminative ability of the learned features. The experimental results demonstrate the effectiveness of our framework on narrowing down the gap between heterogeneous data and obtaining observable improvement in the image-to-video person re-identification task.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.