Abstract

Text-based person retrieval is one of the fundamental tasks in the field of computer vision, which aims to retrieve the most relevant pedestrian image from all the candidates according to textual descriptions. Such a cross-modal retrieval task could be challenging since it requires one to properly select distinguishing clues and perform cross-modal alignments. To achieve cross-modal alignments, most previous works focus on different inter-modal constraints while overlooking the influence of intra-modal noise, yielding sub-optimal retrieved results in certain cases. To this end, we propose a novel framework termed Multi-granularity Separation Network with Bidirectional Refinement Regularization (MSN-BRR) to tackle the problem. The framework consists of two components: (1) Multi-granularity Separation Network, which extracts the multi-grained discriminative textual and visual representations at local and global semantic levels. (2) Bidirectional Refinement Regularization, which alleviates the influence of intra-modal noise and facilitates the proper alignments between the visual and textual representations. Extensive experiments on two widely used benchmarks, i.e., CUHK-PEDES and ICFG-PEDES show that our MSN-BRR method outperforms current state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call