Abstract

Text-based person re-identification aims to retrieve images of the corresponding person from a large visual database according to a natural language description. When it comes to visual local information extraction, most of the state-of-the-art methods adopt either a strict uniform strategy which can be too rough to catch local details properly, or pre-processing with external cues which may suffer from the deviations of the pre-trained model and the large computation consumption. In this paper, we proposed an Adversarial Self-aligned Part Detecting Network (ASPD-Net) model which extracts and combines multi-granular visual and textual features. A novel Self-aligned Part Mask Module was presented to autonomously learn the information of human body parts, and obtain visual local features in a soft-attention manner by using K Self-aligned Part Mask Detectors. Regarding the main model branches as a generator, a discriminator is employed to determine whether the representation vector comes from the visual modality or the textual modality. With Adversarial Loss training, ASPD-Net can learn more robust representations, as long as it successfully tricks the discriminator. Experimental results demonstrate that the proposed ASPD-Net outperforms the previous methods and achieves the state-of-the-art performance on the CUHK-PEDES and RSTPReid datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call