Abstract

The core difficulty of text-based person search is how to achieve fine-grained alignment of visual and linguistic modal data, so as to bridge the gap of modal heterogeneity. Most existing works on this task focus on global and local features extraction and matching, ignoring the importance of relational information. This paper proposes a new text-based person search model, named CM-LRGNet, which extracts Cross-Modal Local-Relational-Global features in an end-to-end manner, and performs fine-grained cross-modal alignment on the above three feature levels. Concretely, we first split the convolutional feature maps to obtain local features of images, and adaptively extract textual local features. Then a relation encoding module is proposed to implicitly learn the relational information implied in the images and texts. Finally, a relation-aware graph attention network is designed to fuse the local and relational features to generate global representations for both images and text queries. Extensive experimental results on benchmark dataset (CUHK-PEDES) show that our approach can achieve state-of-the-art performance (64.18%, 82.97%, 89.85% in terms of Top-1, Top-5, and Top-10 accuracies), by learning and aligning local-relational-global representations from different modalities. Our code has been released in https://github.com/zhangweifeng1218/Text-based-Person-Search.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call