Text-and-Image Learning Transformer for Cross-modal Person Re-identification

Tinghui Wu,Shuhe Zhang,Dihu Chen,Haifeng Hu

doi:10.1145/3686160

Abstract

Text-based person re-identification aims to find the target person from a large pedestrian gallery with the given natural language description. Previous works mainly focus on embedding salient textual and visual representations in a common latent space by utilizing the dual-path structure or parameter-shared network. However, they still lack the ability to effectively extract fine-grained unimodal features as well as fuse the cross-modal data, leading to the increase of misaligned cases. To settle these issues, we propose a text-and-image implicit learning Transformer (TILT) to eliminate textual anisotropy and enhance the cross-modal alignment from both domains based on the bi-direction multimodal encoders. Specifically, we apply the pre-trained multimodal embedding module to overcome the unimodal anisotropy problem with contrastive learning, and map fine-grained features with dual encoder in bidirectional masking. Then, we design the cross-modal interaction encoder to comprehensively mine implicit cross-modal relations by reconstructing masked tokens, and fuse rich multi-modal knowledge in a common space. In addition, the cross-modal similarity matching module is proposed to optimize the intra-domain classification and decrease the inter-domain divergence. Extensive experiments are conducted on three public benchmarks CUHK-PEDES, ICFG-PEDES and RSTPReid to verify the effectiveness of our proposed framework. And results prove that our model outperforms state-of-the-art methods on all metrics.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Text-and-Image Learning Transformer for Cross-modal Person Re-identification

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Multimedia Computing, Communications, and Applications

Lead the way for us

Similar Papers

Text to image generative model using constrained embedding space mapping
Subhajit Chaudhury ... Ryuki Tachibana
-
Subhajit Chaudhury, et. al.Subhajit Chaudhury ... Ryuki Tachibana
01 Sep 2017
01 Sep 2017

FLAN: feature-wise latent additive neural models for biological applications.
An-Phi Nguyen ... Stefania Vasilaki
Briefings in Bioinformatics | VOL. 24
An-Phi Nguyen, et. al.An-Phi Nguyen ... Stefania Vasilaki
06 Apr 2023
Briefings in Bioinformatics | VOL. 24

Iterative graph attention memory network for cross-modal retrieval
Xinfeng Dong ... Xu Lu
Knowledge-Based Systems | VOL. 226
Xinfeng Dong, et. al.Xinfeng Dong ... Xu Lu
13 May 2021
Knowledge-Based Systems | VOL. 226

Multiple Outlooks Learning with Support Vector Machines
Yinglu Liu ... Kaizhu Huang
-
Yinglu Liu, et. al.Yinglu Liu ... Kaizhu Huang
01 Jan 2012
01 Jan 2012

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Text-and-Image Learning Transformer for Cross-modal Person Re-identification

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Multimedia Computing, Communications, and Applications