An Empirical Study of CLIP for Text-Based Person Search

Min Cao,Yang Bai,Mang Ye,Min Zhang,Ziyin Zeng

doi:10.1609/aaai.v38i1.27801

Abstract

Text-based Person Search (TBPS) aims to retrieve the person images using natural language descriptions. Recently, Contrastive Language Image Pretraining (CLIP), a universal large cross-modal vision-language pre-training model, has remarkably performed over various cross-modal downstream tasks due to its powerful cross-modal semantic learning capacity. TPBS, as a fine-grained cross-modal retrieval task, is also facing the rise of research on the CLIP-based TBPS. In order to explore the potential of the visual-language pre-training model for downstream TBPS tasks, this paper makes the first attempt to conduct a comprehensive empirical study of CLIP for TBPS and thus contribute a straightforward, incremental, yet strong TBPS-CLIP baseline to the TBPS community. We revisit critical design considerations under CLIP, including data augmentation and loss function. The model, with the aforementioned designs and practical training tricks, can attain satisfactory performance without any sophisticated modules. Also, we conduct the probing experiments of TBPS-CLIP in model generalization and model compression, demonstrating the effectiveness of TBPS-CLIP from various aspects. This work is expected to provide empirical insights and highlight future CLIP-based TBPS research.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

An Empirical Study of CLIP for Text-Based Person Search

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Journal: Proceedings of the AAAI Conference on Artificial Intelligence	Publication Date: Mar 24, 2024
Citations: 3

Similar Papers

Contrastive Code-Comment Pre-training
Xiaohuan Pei ... Chang Xu
-
Xiaohuan Pei, et. al.Xiaohuan Pei ... Chang Xu
01 Nov 2022
01 Nov 2022

XCode : Towards Cross-Language Code Representation with Large-Scale Pre-Training
Zehao Lin ... Xiangji Zeng
ACM Transactions on Software Engineering and Methodology | VOL. 31
Zehao Lin, et. al.Zehao Lin ... Xiangji Zeng
09 Apr 2022
ACM Transactions on Software Engineering and Methodology | VOL. 31

Expanding Large Pre-trained Unimodal Models with Multimodal Information Injection for Image-Text Multimodal Classification
Tao Liang ... Fengmao Lv
-
Tao Liang, et. al.Tao Liang ... Fengmao Lv
01 Jun 2022
01 Jun 2022

WhisPAr: Transferring pre-trained audio models to fine-grained classification via Prompt and Adapter
Bin Shi ... Meng Zhao
Knowledge-Based Systems | VOL. 300
Bin Shi, et. al.Bin Shi ... Meng Zhao
09 Jul 2024
Knowledge-Based Systems | VOL. 300

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An Empirical Study of CLIP for Text-Based Person Search

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence