Zero-Shot Composed Image Retrieval Considering Query-Target Relationship Leveraging Masked Image-Text Pairs
This paper proposes a novel zero-shot composed image retrieval (CIR) method considering the query-target relationship by masked image-text pairs. The objective of CIR is to retrieve the target image using a query image and a query text. Existing methods use a textual inversion network to convert the query image into a pseudo word to compose the image and text and use a pre-trained visuallanguage model to realize the retrieval. However, they do not consider the query-target relationship to train the textual inversion network to acquire information for retrieval. In this paper, we propose a novel zero-shot CIR method that is trained end-to-end using masked image-text pairs. By exploiting the abundant image-text pairs that are convenient to obtain with a masking strategy for learning the query-target relationship, it is expected that accurate zero-shot CIR using a retrieval-focused textual inversion network can be realized. Experimental results show the effectiveness of the proposed method.
- Book Chapter
2
- 10.1007/978-3-030-45439-5_4
- Jan 1, 2020
We address and formalise the task of sequence-to-sequence (seq2seq) cross-modal retrieval. Given a sequence of text passages as query, the goal is to retrieve a sequence of images that best describes and aligns with the query. This new task extends the traditional cross-modal retrieval, where each image-text pair is treated independently ignoring broader context. We propose a novel variational recurrent seq2seq (VRSS) retrieval model for this seq2seq task. Unlike most cross-modal methods, we generate an image vector corresponding to the latent topic obtained from combining the text semantics and context. This synthetic image embedding point associated with every text embedding point can then be employed for either image generation or image retrieval as desired. We evaluate the model for the application of stepwise illustration of recipes, where a sequence of relevant images are retrieved to best match the steps described in the text. To this end, we build and release a new Stepwise Recipe dataset for research purposes, containing 10K recipes (sequences of image-text pairs) having a total of 67K image-text pairs. To our knowledge, it is the first publicly available dataset to offer rich semantic descriptions in a focused category such as food or recipes. Our model is shown to outperform several competitive and relevant baselines in the experiments. We also provide qualitative analysis of how semantically meaningful the results produced by our model are through human evaluation and comparison with relevant existing methods.
- Research Article
16
- 10.3390/app13010282
- Dec 26, 2022
- Applied Sciences
Remote sensing technology has advanced rapidly in recent years. Because of the deployment of quantitative and qualitative sensors, as well as the evolution of powerful hardware and software platforms, it powers a wide range of civilian and military applications. This in turn leads to the availability of large data volumes suitable for a broad range of applications such as monitoring climate change. Yet, processing, retrieving, and mining large data are challenging. Usually, content-based remote sensing image (RS) retrieval approaches rely on a query image to retrieve relevant images from the dataset. To increase the flexibility of the retrieval experience, cross-modal representations based on text–image pairs are gaining popularity. Indeed, combining text and image domains is regarded as one of the next frontiers in RS image retrieval. Yet, aligning text to the content of RS images is particularly challenging due to the visual-sematic discrepancy between language and vision worlds. In this work, we propose different architectures based on vision and language transformers for text-to-image and image-to-text retrieval. Extensive experimental results on four different datasets, namely TextRS, Merced, Sydney, and RSICD datasets are reported and discussed.
- Research Article
1
- 10.1609/aaai.v39i5.32543
- Apr 11, 2025
- Proceedings of the AAAI Conference on Artificial Intelligence
The aim of text-based person retrieval is to identify pedestrians using natural language descriptions within a large-scale image gallery. Traditional methods rely heavily on manually annotated image-text pairs, which are resource-intensive to obtain. With the emergence of Large Vision-Language Models (LVLMs), the advanced capabilities of contemporary models in image understanding have led to the generation of highly accurate captions. Therefore, this paper explores the potential of employing Large Vision-Language Models for unsupervised text-based pedestrian image retrieval and proposes a Multi-grained Uncertainty Modeling and Alignment framework (MUMA). Initially, multiple Large Vision-Language Models are employed to generate diverse and hierarchically structured pedestrian descriptions across different styles and granularities. However, the generated captions inevitably introduce noise. To address this issue, an uncertainty-guided sample filtration module is proposed to estimate and filter out unreliable image-text pairs. Additionally, to simulate the diversity of styles and granularities in captions, a multi-grained uncertainty modeling approach is applied to model the distributions of captions, with each caption represented as a multivariate Gaussian distribution. Finally, a multi-level consistency distillation loss is employed to integrate and align the multi-grained captions, aiming to transfer knowledge across different granularities. Experimental evaluations conducted on three widely-used datasets demonstrate the significant advancements achieved by our approach.
- Conference Article
3
- 10.1145/3591106.3592234
- Jun 12, 2023
In this study, we proposed text-to-image fashion image retrieval that captures the texture of clothing fabrics. A fabric’s texture is a major factor governing the comfort and appearance of clothes and significantly influences user preferences. However, unlike patterns and shapes that can readily be captured from a global image of the entire piece of clothing, extracting the fine and ambiguous characteristics of textures is considerably more challenging. The key concept is that by focusing on the "local" regions of clothing, detailed fabric textures can be more accurately captured. To this end, we propose a framework for learning cross-modal features from both global (the entire garment) and local (a close-up detail) image-text pairs. To verify the idea, we constructed a new dataset named Global and Local FACAD (G&L FACAD) by modifying the existing large-scale public FACAD dataset used for fashion retrieval. The experimental results confirm that the retrieval accuracy is significantly improved compared to the baselines. The code is available at https://github.com/SuzukiDaichi-git/texture_aware_fashion_retrieval.git.
- Book Chapter
1
- 10.1007/978-3-030-67832-6_19
- Jan 1, 2021
Language person search, which means retrieving specific person images with natural language description, is becoming a research hotspot in the area of person re-identification. Compared with person re-identification which belongs to image retrieval task, language person search poses challenges due to heterogeneous semantic gap between different modal data of image and text. To solve this problem, most existing methods employ softmax-based classification loss in order to embed the visual and textual features into a common latent space. However, pair-based loss, as a successful approach of metric learning, is hardly mentioned in this task. Recently, pair-based weighting loss for deep metric learning has shown great potential in improving the performance of many retrieval-related tasks. In this paper, to better correlate person image with given language description, we introduce pair-based weighting loss which encourages model to assign appropriate weights to different image-text pairs. We have conducted extensive experiments on the dataset CUHK-PEDES and the experimental results validated the effectiveness of our proposed method.KeywordsLanguage person searchDeep metric learningPair weighting
- Conference Article
233
- 10.1109/iccv.2011.6126524
- Nov 1, 2011
Many applications involve multiple-modalities such as text and images that describe the problem of interest. In order to leverage the information present in all the modalities, one must model the relationships between them. While some techniques have been proposed to tackle this problem, they either are restricted to words describing visual objects only, or require full correspondences between the different modalities. As a consequence, they are unable to tackle more realistic scenarios where a narrative text is only loosely related to an image, and where only a few image-text pairs are available. In this paper, we propose a model that addresses both these challenges. Our model can be seen as a Markov random field of topic models, which connects the documents based on their similarity. As a consequence, the topics learned with our model are shared across connected documents, thus encoding the relations between different modalities. We demonstrate the effectiveness of our model for image retrieval from a loosely related text.
- Research Article
1
- 10.1371/journal.pone.0310098
- Sep 9, 2024
- PloS one
Conditional image retrieval (CIR), which involves retrieving images by a query image along with user-specified conditions, is essential in computer vision research for efficient image search and automated image analysis. The existing approaches, such as composed image retrieval (CoIR) methods, have been actively studied. However, these methods face challenges as they require either a triplet dataset or richly annotated image-text pairs, which are expensive to obtain. In this work, we demonstrate that CIR at the image-level concept can be achieved using an inverse mapping approach that explores the model's inductive knowledge. Our proposed CIR method, called Backward Search, updates the query embedding to conform to the condition. Specifically, the embedding of the query image is updated by predicting the probability of the label and minimizing the difference from the condition label. This enables CIR with image-level concepts while preserving the context of the query. In this paper, we introduce the Backward Search method that enables single and multi-conditional image retrieval. Moreover, we efficiently reduce the computation time by distilling the knowledge. We conduct experiments using the WikiArt, aPY, and CUB benchmark datasets. The proposed method achieves an average mAP@10 of 0.541 on the datasets, demonstrating a marked improvement compared to the CoIR methods in our comparative experiments. Furthermore, by employing knowledge distillation with the Backward Search model as the teacher, the student model achieves a significant reduction in computation time, up to 160 times faster with only a slight decrease in performance. The implementation of our method is available at the following URL: https://github.com/dhlee-work/BackwardSearch.
- Book Chapter
3
- 10.1007/978-3-642-15246-7_11
- Jan 1, 2010
Humans can associate vision and language modalities and thus generate mental imagery, i.e. visual images, from linguistic input in an environment of unlimited inflowing information. Inspired by human memory, we separate a text-to-image retrieval task into two steps: 1) text-to-image conversion (generating visual queries for the 2 step) and 2) image-to-image retrieval task. This separation is advantageous for inner representation visualization, learning incremental dataset, using the results of content-based image retrieval. Here, we propose a visual query expansion method that simulates the capability of human associative memory. We use a hyperenetwork model (HN) that combines visual words and linguistic words. HNs learn the higher-order cross-modal associative relationships incrementally on a set of image-text pairs in sequence. An incremental HN generates images by assembling visual words based on linguistic cues. And we retrieve similar images with the generated visual query. The method is evaluated on 26 video clips of ‘Thomas and Friends’. Experiments show the performance of successive image retrieval rate up to 98.1% with a single text cue. It shows the additional potential to generate the visual query with several text cues simultaneously.KeywordsVisual WordImage PatchMental ImageryIncremental LearningRetrieval TaskThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
- Research Article
18
- 10.1016/j.patcog.2014.07.001
- Jul 9, 2014
- Pattern Recognition
Automatic image–text alignment for large-scale web image indexing and retrieval
- Research Article
3
- 10.54097/ak59r592
- Jan 20, 2024
- Academic Journal of Science and Technology
Visual language pretraining (VLP) has made significant progress in improving performance on multiple visual language tasks. However, most current pre-trained models are either good at comprehension tasks or focus on generative tasks. Furthermore, performance improvements often rely primarily on expanding datasets generated by collecting noisy image-text pairs from networks that are suboptimal sources of supervision. In this paper, we propose a new VLP framework, namely BLIP, which can be flexibly applied to visual language understanding and generation tasks. BLIP effectively utilizes noisy network data by guiding subtitles. Its subtitle generator produces synthetic subtitles, and filters are used to clean these noisy subtitles. In order to meet the practical needs of existing search engines to improve retrieval speed and retrieval accuracy, this paper proposes an improved method based on the BLIP algorithm. We migrated the image and text retrieval strategy of the BLIP algorithm from itc comparison to itm comparison, and improved the model's positive and negative sample discrimination ability by using the hard-sample strategy. We further improve the retrieval accuracy of the model.
- Research Article
7
- 10.1016/j.inffus.2024.102535
- Jun 26, 2024
- Information Fusion
Vision-language joint representation learning for sketch less facial image retrieval
- Book Chapter
37
- 10.1007/978-3-031-19833-5_37
- Jan 1, 2022
Large-scale Vision-and-Language (V+L) pre-training for representation learning has proven to be effective in boosting various downstream V+L tasks. However, when it comes to the fashion domain, existing V+L methods are inadequate as they overlook the unique characteristics of both fashion V+L data and downstream tasks. In this work, we propose a novel fashion-focused V+L representation learning framework, dubbed as FashionViL. It contains two novel fashion-specific pre-training tasks designed particularly to exploit two intrinsic attributes with fashion V+L data. First, in contrast to other domains where a V+L datum contains only a single image-text pair, there could be multiple images in the fashion domain. We thus propose a Multi-View Contrastive Learning task for pulling closer the visual representation of one image to the compositional multimodal representation of another image+text. Second, fashion text (e.g., product description) often contains rich fine-grained concepts (attributes/noun phrases). To capitalize this, a Pseudo-Attributes Classification task is introduced to encourage the learned unimodal (visual/textual) representations of the same concept to be adjacent. Further, fashion V+L tasks uniquely include ones that do not conform to the common one-stream or two-stream architectures (e.g., text-guided image retrieval). We thus propose a flexible, versatile V+L model architecture consisting of a modality-agnostic Transformer so that it can be flexibly adapted to any downstream tasks. Extensive experiments show that our FashionViL achieves new state of the art across five downstream tasks. Code is available at https://github.com/BrandonHanx/mmf.KeywordsVision and LanguageRepresentation learningFashion
- Research Article
5
- 10.1109/access.2023.3239858
- Jan 1, 2023
- IEEE Access
A cross-modal image retrieval that explicitly considers semantic relationships between images and texts is proposed. Most conventional cross-modal image retrieval methods retrieve the target images by directly measuring the similarities between the candidate images and query texts in a common semantic embedding space. However, such methods tend to focus on a one-to-one correspondence between a predefined image-text pair during the training phase, and other semantically similar images and texts are ignored. By considering the many-to-many correspondences between semantically similar images and texts, a common embedding space is constructed to assure semantic relationships, which allows users to accurately find more images that are related to the input query texts. Thus, in this paper, we propose a cross-modal image retrieval method that considers semantic relationships between images and texts. The proposed method calculates the similarities between texts as semantic similarities to acquire the relationships. Then, we introduce a loss function that explicitly constructs the many-to-many correspondences between semantically similar images and texts from their semantic relationships. We also propose an evaluation metric to assess whether each method can construct an embedding space considering the semantic relationships. Experimental results demonstrate that the proposed method outperforms conventional methods in terms of this newly proposed metric.
- Conference Article
- 10.24132/csrn.3401.20
- Jan 1, 2024
- Computer Science Research Notes
In this study, we proposed a method for automatically generating high-quality CLIP training data to enhance the performance of text-based image retrieval with CLIP. CLIP training utilizes two types of image-text pair data: compatible (correct) pairs and incompatible (incorrect) pairs. Correct pairs, where the image and text content match, are traditionally created by scraping or similar methods, while incorrect pairs are formed by recombining elements of correct pairs. CLIP is trained contrastively to increase the similarity within correct pairs and decrease it within incorrect pairs. However, when there are multiple images in the training data that are similar to each other, the text attached to them is also considered to be similar to each other, and although it is preferable to treat them as correct pairs, changed pairs are treated as incorrect pairs. Therefore, if two images taken from the training data are not similar, then the similarity between texts assigned to them should also be low, so that a highly reliable incorrect pair can be created by exchanging the assigned text with each other. We applied this idea to the results of clustering the images and texts in the training data, respectively, and used the similarity between the clusters to generate an incorrect pair, then learned to increase the negative effect as the similarity between images was lower. The results of an experiment using the Amazon review dataset, which is commonly used in this field, showed a 27.0% improvement in Rank@1 score compared to vanilla CLIP.