Text Query Research Articles

Text-based person search (TBPS) is a challenging task that aims to search pedestrian images with the same identity from an image gallery given a query text. In recent years, TBPS has made remarkable progress, and state-of-the-art (SOTA) methods achieve superior performance by learning local fine-grained correspondence between images and texts. However, most existing methods rely on explicitly generated local parts to model fine-grained correspondence between modalities, which is unreliable due to the lack of contextual information or the potential introduction of noise. Moreover, the existing methods seldom consider the information inequality problem between modalities caused by image-specific information. To address these limitations, we propose an efficient joint multilevel alignment network (MANet) for TBPS, which can learn aligned image/text feature representations between modalities at multiple levels, and realize fast and effective person search. Specifically, we first design an image-specific information suppression (ISS) module, which suppresses image background and environmental factors by relation-guided localization (RGL) and channel attention filtration (CAF), respectively. This module effectively alleviates the information inequality problem and realizes the alignment of information volume between images and texts. Second, we propose an implicit local alignment (ILA) module to adaptively aggregate all pixel/word features of image/text to a set of modality-shared semantic topic centers and implicitly learn the local fine-grained correspondence between modalities without additional supervision and cross-modal interactions. Also, a global alignment (GA) is introduced as a supplement to the local perspective. The cooperation of global and local alignment modules enables better semantic alignment between modalities. Extensive experiments on multiple databases demonstrate the effectiveness and superiority of our MANet.

Read full abstract

Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include composed video retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. We further validate that our methodology is equally applicable to image-caption pairs, by generating 3.3 million CoIR training triplets using the Conceptual Captions dataset. Our model builds on BLIP-2 pretraining, adapting it to composed video (or image) retrieval, and incorporates an additional caption retrieval loss to exploit extra supervision beyond the triplet, which is possible since captions are readily available for our training data by design. We provide extensive ablations to analyze the design choices on our new CoVR benchmark. Our experiments also demonstrate that training a CoVR model on our datasets effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on the CIRR, FashionIQ, and CIRCO benchmarks. Our code, datasets, and models are publicly available at https://imagine.enpc.fr/~ventural/covr.

Read full abstract

Text Query Research Articles

Related Topics

Articles published on Text Query

Minebot: Chatbot to Respond to Text Queries Pertaining to Various Acts, Rules, And Regulations Applicable to Mining Industries

Leaf Senescence Database v5.0: A Comprehensive Repository for Facilitating Plant Senescence Research

Searching the Genetic Programming Bibliography

Hierarchical matching and reasoning for multi-query image retrieval

End-to-end Visual Grounding Based on Query Text Guidance and Multi-stage Reasoning

Deep learning-based information retrieval with normalized dominant feature subset and weighted vector model

Improving Text-Based Person Retrieval by Excavating All-Round Information Beyond Color.

Symbolic Music Generation From Graph-Learning-Based Preference Modeling and Textual Queries

Image-Specific Information Suppression and Implicit Local Alignment for Text-Based Person Search.

CoVR-2: Automatic Data Construction for Composed Video Retrieval.

Improving First-stage Retrieval of Point-of-interest Search by Pre-training Models

Sparse graph matching network for temporal language localization in videos

Towards Unsupervised Referring Expression Comprehension with Visual Semantic Parsing

Multi-modal interaction with transformers: bridging robots and human with natural language

WITHDRAWN: Efficient fragmentation and allocation on clustering in distributed environment (FACE)

Fully Transformer-Equipped Architecture for end-to-end Referring Video Object Segmentation

PMG—Pyramidal Multi-Granular Matching for Text-Based Person Re-Identification

Query-Guided Refinement and Dynamic Spans Network for Video Highlight Detection and Temporal Grounding in Online Information Systems

Webly-supervised semantic segmentation via curriculum learning

Image Retrieval Through Free-Form Query using Intelligent Text Processing

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Text Query Research Articles

Related Topics

Articles published on Text Query

Minebot: Chatbot to Respond to Text Queries Pertaining to Various Acts, Rules, And Regulations Applicable to Mining Industries

Leaf Senescence Database v5.0: A Comprehensive Repository for Facilitating Plant Senescence Research

Searching the Genetic Programming Bibliography

Hierarchical matching and reasoning for multi-query image retrieval

End-to-end Visual Grounding Based on Query Text Guidance and Multi-stage Reasoning

Deep learning-based information retrieval with normalized dominant feature subset and weighted vector model

Improving Text-Based Person Retrieval by Excavating All-Round Information Beyond Color.

Symbolic Music Generation From Graph-Learning-Based Preference Modeling and Textual Queries

Image-Specific Information Suppression and Implicit Local Alignment for Text-Based Person Search.

CoVR-2: Automatic Data Construction for Composed Video Retrieval.

Improving First-stage Retrieval of Point-of-interest Search by Pre-training Models

Sparse graph matching network for temporal language localization in videos

Towards Unsupervised Referring Expression Comprehension with Visual Semantic Parsing

Multi-modal interaction with transformers: bridging robots and human with natural language

WITHDRAWN: Efficient fragmentation and allocation on clustering in distributed environment (FACE)

Fully Transformer-Equipped Architecture for end-to-end Referring Video Object Segmentation

PMG—Pyramidal Multi-Granular Matching for Text-Based Person Re-Identification

Query-Guided Refinement and Dynamic Spans Network for Video Highlight Detection and Temporal Grounding in Online Information Systems

Webly-supervised semantic segmentation via curriculum learning

Image Retrieval Through Free-Form Query using Intelligent Text Processing