Causality-Inspired Invariant Representation Learning for Text-Based Person Retrieval

Yu Liu,Zhiyong Cheng,Haipeng Chen,Guihe Qin,Xun Yang

doi:10.1609/aaai.v38i12.29314

Abstract

Text-based Person Retrieval (TPR) aims to retrieve relevant images of specific pedestrians based on the given textual query. The mainstream approaches primarily leverage pretrained deep neural networks to learn the mapping of visual and textual modalities into a common latent space for cross-modality matching. Despite their remarkable achievements, existing efforts mainly focus on learning the statistical cross-modality correlation found in training data, other than the intrinsic causal correlation. As a result, they often struggle to retrieve accurately in the face of environmental changes such as illumination, pose, and occlusion, or when encountering images with similar attributes. In this regard, we pioneer the observation of TPR from a causal view. Specifically, we assume that each image is composed of a mixture of causal factors (which are semantically consistent with text descriptions) and non-causal factors (retrieval-irrelevant, e.g., background), and only the former can lead to reliable retrieval judgments. Our goal is to extract text-critical robust visual representation (i.e., causal factors) and establish domain invariant cross-modality correlations for accurate and reliable retrieval. However, causal/non-causal factors are unobserved, so we emphasize that ideal causal factors that can simulate causal scenes should satisfy two basic principles:1） Independence: being independent of non-causal factors, and 2）Sufficiency: being causally sufficient for TPR across different environments. Building on that, we propose an Invariant Representation Learning method for TPR (IRLT), that enforces the visual representations to satisfy the two aforementioned critical properties. Extensive experiments on three datasets clearly demonstrate the advantages of IRLT over leading baselines in terms of accuracy and generalization.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Causality-Inspired Invariant Representation Learning for Text-Based Person Retrieval

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Journal: Proceedings of the AAAI Conference on Artificial Intelligence	Publication Date: Mar 24, 2024
Citations: 2

Similar Papers

Unifying Multi-Modal Uncertainty Modeling and Semantic Alignment for Text-to-Image Person Re-identification
Zhiwei Zhao ... Bin Liu
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38
Zhiwei Zhao, et. al.Zhiwei Zhao ... Bin Liu
24 Mar 2024
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38

Learning the structure of image collections with latent aspect models

-

01 Jan 2007
01 Jan 2007

DSG-GAN: Multi-turn text-to-image synthesis via dual semantic-stream guidance with global and local linguistics
Heyu Sun ... Qiang Guo
Intelligent Systems with Applications | VOL. 20
Heyu Sun, et. al.Heyu Sun ... Qiang Guo
30 Aug 2023
Intelligent Systems with Applications | VOL. 20

AMECON
Ines Chami ... Hervé Le Borgne
-
Ines Chami, et. al.Ines Chami ... Hervé Le Borgne
06 Jun 2017
06 Jun 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Causality-Inspired Invariant Representation Learning for Text-Based Person Retrieval

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence