Abstract

AbstractThe target of text‐based person re‐identification (Re‐ID) is to retrieve the corresponding image of a person through the given text information. However, due to the homogeneous variety and modality heterogeneity, it is challenging to simultaneously learn both global‐level and local‐level cross‐modal features and align them in the same embedding space without additional networks. To address these problems, an effective multi‐level cross‐modality learning (MCL) framework for language and vision person Re‐ID is proposed. More specifically, a multi‐branch feature extraction (MFE) module is designed to comprehensively map both global and partial semantic information for the visual and textual embedding at the same time, capturing the intra‐class semantic relationships in multi‐granularities. Besides, a cross‐modal alignment (CA) module is devised to match the multi‐grained representations and reduce the inter‐class gap from global‐level to partial‐level. Extensive experiments conducted on the CUHK‐PEDES and ICFG‐PEDES datasets suggest that this method outperforms the state‐of‐the‐art models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call