Abstract
Text-based person retrieval is one of the fundamental tasks in the field of computer vision, which aims to retrieve the most relevant pedestrian image from all the candidates according to textual descriptions. Such a cross-modal retrieval task could be challenging since it requires one to properly select distinguishing clues and perform cross-modal alignments. To achieve cross-modal alignments, most previous works focus on different inter-modal constraints while overlooking the influence of intra-modal noise, yielding sub-optimal retrieved results in certain cases. To this end, we propose a novel framework termed Multi-granularity Separation Network with Bidirectional Refinement Regularization (MSN-BRR) to tackle the problem. The framework consists of two components: (1) Multi-granularity Separation Network, which extracts the multi-grained discriminative textual and visual representations at local and global semantic levels. (2) Bidirectional Refinement Regularization, which alleviates the influence of intra-modal noise and facilitates the proper alignments between the visual and textual representations. Extensive experiments on two widely used benchmarks, i.e., CUHK-PEDES and ICFG-PEDES show that our MSN-BRR method outperforms current state-of-the-art methods.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.