RAGCap: retrieval-augmented generation for style-aware remote sensing image captioning without fine-tuning
ABSTRACT Recently, generalist vision-language models (VLMs) have proven to be highly versatile across various tasks, making them indispensable in computer vision and natural language processing. Their broad pre-training enables robust performance across many domains. When it comes to tailoring these models for specific downstream tasks such as remote sensing (RS) image captioning, the traditional approach has been fine-tuning. Fine-tuning, however, is often computationally expensive and may compromise the model inherent generalization capabilities by over-specializing on limited domain-specific datasets. This paper investigates Retrieval-Augmented Generation (RAG) as an alternative strategy for adapting generalist VLMs to RS captioning without requiring fine-tuning. We introduce RAGCap, a retrieval-augmented framework that leverages similarity-based retrieval to select relevant image-caption pairs from the training dataset. These examples are then combined with the target image within a carefully designed prompt structure, guiding the generalist VLM to generate stylistically coherent RS captions to the training dataset. While our implementation utilizes SigLIP for retrieval and Qwen2VL as the base VLM, the proposed framework is universal and applicable to other models. Extensive evaluations on four RS benchmark datasets reveal that RAGCap achieves competitive performance compared to traditional fine-tuning approaches. Our findings suggest RAG methods like RAGCap offer a scalable, practical alternative to fine-tuning for domain adaptation in RS image captioning. Code will be available at: https://github.com/BigData-KSU/RAGCap.
- Research Article
47
- 10.1145/3009906
- Dec 12, 2016
- ACM Computing Surveys
Integrating computer vision and natural language processing is a novel interdisciplinary field that has received a lot of attention recently. In this survey, we provide a comprehensive introduction of the integration of computer vision and natural language processing in multimedia and robotics applications with more than 200 key references. The tasks that we survey include visual attributes, image captioning, video captioning, visual question answering, visual retrieval, human-robot interaction, robotic actions, and robot navigation. We also emphasize strategies to integrate computer vision and natural language processing models as a unified theme of distributional semantics. We make an analog of distributional semantics in computer vision and natural language processing as image embedding and word embedding, respectively. We also present a unified view for the field and propose possible future directions.
- Research Article
37
- 10.1109/tcyb.2022.3222606
- Nov 1, 2023
- IEEE Transactions on Cybernetics
Remote sensing image captioning (RSIC), which describes a remote sensing image with a semantically related sentence, has been a cross-modal challenge between computer vision and natural language processing. For visual features extracted from remote sensing images, global features provide the complete and comprehensive visual relevance of all the words of a sentence simultaneously, while local features can emphasize the discrimination of these words individually. Therefore, not only global features are important for caption generation but also local features are meaningful for making the words more discriminative. In order to make full use of the advantages of both global and local features, in this article, we propose an attention-based global-local captioning model (GLCM) to obtain global-local visual feature representation for RSIC. Based on the proposed GLCM, the correlation of all the generated words and the relation of each separate word and the most related local visual features can be visualized in a similarity-based manner, which provides more interpretability for RSIC. In the extensive experiments, our method achieves comparable results in UCM-captions and superior results in Sydney-captions and RSICD which is the largest RSIC dataset.
- Conference Article
5
- 10.1109/icaiic57133.2023.10067039
- Feb 20, 2023
Image Captioning aspires to achieve a description of images with machines as a combination of Computer Vision (CV) and Natural Language Processing (NLP) fields. The current state of the art for image captioning use the Attention-based Encoder-Decoder model. The Attention-based model uses an ‘Attention mechanism’ that focuses on a particular section of the image to generate its corresponding caption word. The NLP side of this model uses Long Short-Term Memory (LSTM) for word generation. Attention-based models did not emphasize the relative arrangement of words in a caption thereby, ignoring the context of the sentence. Inspired by the versatility of Transformers in NLP, this work tries to utilise its architecture features for the Image Captioning use case. This work also makes use of a pretrained Bidirectional Encoder Representation of Transformer (BERT) which generates a contextually rich embedding of a caption. The Multi-Head Attention of the Transformer establishes a strong correlation between the image and contextually aware caption. This experiment is performed on the Remote Sensing Image Captioning Dataset. The results of the model are evaluated using NLP evaluation metrics such as Bilingual Evaluation Understudy 1–4 (BLEU), Metric for Evaluation of Translation with Explicit ORdering (METEOR) and Recall-Oriented Understudy for Gisting Evaluation (ROUGE). The proposed model shows better results for a few of the metrics.
- Research Article
43
- 10.1109/tgrs.2021.3102590
- Jan 1, 2022
- IEEE Transactions on Geoscience and Remote Sensing
The remote sensing image captioning has attracted wide spread attention in remote sensing field due to its application potentiality. However, most existing approaches model limited interactions between image content and sentence and fail to exploit special characteristics of the remote sensing images. We introduce a novel recurrent attention and semantic gate (RASG) framework to facilitate the remote sensing image captioning in this article, which integrates competitive visual features and a recurrent attention mechanism to generate a better context vector for the images every time as well as enhances the representations of the current word state. Specifically, we first project each image into competitive visual features by taking the advantage of both static visual features and multiscale features. Then, a novel recurrent attention mechanism is developed to extract the high-level attentive maps from encoded features and nonvisual features, which can help the decoder recognize and focus on the effective information for understanding the complex content of the remote sensing images. Finally, the hidden states from the long short-term memory (LSTM) and other semantic references are incorporated into a semantic gate, which contributes to more comprehensive and precise semantic understanding. Comprehensive experiments on three widely used datasets, Sydney-Captions, UCM-Captions, and Remote Sensing Image Captioning Dataset, have demonstrated the superiority of the proposed RASG over a series of attentive models based on image captioning methods.
- Research Article
114
- 10.3390/rs11060612
- Mar 13, 2019
- Remote Sensing
Image captioning generates a semantic description of an image. It deals with image understanding and text mining, which has made great progress in recent years. However, it is still a great challenge to bridge the “semantic gap” between low-level features and high-level semantics in remote sensing images, in spite of the improvement of image resolutions. In this paper, we present a new model with an attribute attention mechanism for the description generation of remote sensing images. Therefore, we have explored the impact of the attributes extracted from remote sensing images on the attention mechanism. The results of our experiments demonstrate the validity of our proposed model. The proposed method obtains six higher scores and one slightly lower, compared against several state of the art techniques, on the Sydney Dataset and Remote Sensing Image Caption Dataset (RSICD), and receives all seven higher scores on the UCM Dataset for remote sensing image captioning, indicating that the proposed framework achieves robust performance for semantic description in high-resolution remote sensing images.
- Research Article
45
- 10.1109/access.2019.2962195
- Jan 1, 2020
- IEEE Access
Remote sensing image captioning, which aims to understand high-level semantic information and interactions of different ground objects, is a new emerging research topic in recent years. Though image captioning has developed rapidly with convolutional neural networks (CNNs) and recurrent neural networks (RNNs), the image captioning task for remote sensing images still suffers from two main limitations. One limitation is that the scales of objects in remote sensing images vary dramatically, which makes it difficult to obtain an effective image representation. Another limitation is that the visual relationship in remote sensing images is still underused, which should have great potential to improve the final performance. In order to deal with these two limitations, an effective framework for captioning the remote sensing image is proposed in this paper. The framework is based on multi-level attention and multi-label attribute graph convolution. Specifically, the proposed multi-level attention module can adaptively focus not only on specific spatial features, but also on features of specific scales. Moreover, the designed attribute graph convolution module can employ the attribute-graph to learn more effective attribute features for image captioning. Extensive experiments are conducted and the proposed method achieves superior performance on UCM-captions, Sydney-captions and RSICD dataset.
- Research Article
- 10.32377/cvrjst2614
- Jun 1, 2024
- CVR Journal of Science and Technology
Image captioning is an interdisciplinary area that uses techniques from computer vision and natural language processing to provide a textual description of a picture. The Image captioning task is the process of understanding the scene present in the image by identifying objects and associated actions present to create a meaningful human-like caption which can be used for wide range of applications, including image retrieval, video indexing, assistive technology for the visually impaired, content-based image search, biomedicine, and autonomous cars. Formerly, Machine Learning was utilized for this purpose which will be extensive use of hand-crafted features such as Scale-Invariant Feature Transform (SIFT), Local Binary Patterns (LBP), the Histogram of Oriented Gradients (HOG), and combinations of these features. Extracting handmade characteristics from huge datasets is not straightforward or viable. Many deep learning-based techniques were later proposed. Deep Learning retrieval and template-based approaches were presented; however, both had drawbacks such as losing crucial objects. Recent breakthroughs in deep learning and natural language processing have resulted in considerable increases in image captioning system performance which involves adopting attention mechanisms, transformer-based architectures, multi modal connections, Object-Detection based encoder-decoder and many others. In this survey will explore some of the most recent techniques for image captioning, the datasets and evaluation measures that have been employed in deep learning-based automatic image captioning. The ultimate intention of this study is to act as a guide for researchers by emphasizing future directions for research work. Index Terms: image captioning, computer vision, deep learning, Textual description, natural language processing.
- Research Article
- 10.3390/rs17071237
- Mar 31, 2025
- Remote Sensing
Remote sensing image captioning (RSIC) aims to describe ground objects and scenes within remote sensing images in natural language form. As the complexity and diversity of scenes in remote sensing images increase, existing methods, although effective in specific tasks, are largely trained on particular scene images and corpora. This limits their ability to generate descriptions for scenes not encountered during training. Given the finite resources for data annotation and the expanding range of application scenarios, training data typically cover only a subset of common scenes, leaving many potential scene types unrepresented. Consequently, developing models capable of effectively handling unseen scenes with limited training data is imperative. This study introduces an innovative remote sensing image captioning model based on scene attribute learning—SALCap. The proposed model defines scene attributes and employs a specifically designed global object scene attribute extractor to capture these attributes. It then uses an attribute inference module to predict scene information through scene attributes, ensuring that this part of the scene’s information is reused in sentence generation through additional attribute loss. Experiments show that the method not only improves the accuracy of the description but also significantly enhances the model’s adaptability and generalizability relative to unseen scenes. This advancement expands the practical utility of remote sensing image captioning across diverse scenarios, particularly under the constraints of limited annotations.
- Book Chapter
10
- 10.1007/978-981-10-5209-5_10
- Jan 1, 2018
Natural language generation from images, referred to as image or visual captioning also, is an emerging deep learning application that is in the intersection between computer vision and natural language processing. Image captioning also forms the technical foundation for many practical applications. The advances in deep learning technologies have created significant progress in this area in recent years. In this chapter, we review the key developments in image captioning and their impact in both research and industry deployment. Two major schemes developed for image captioning, both based on deep learning, are presented in detail. A number of examples of natural language descriptions of images produced by two state-of-the-art captioning systems are provided to illustrate the high quality of the systems’ outputs. Finally, recent research on generating stylistic natural language from images is reviewed.
- Research Article
224
- 10.1109/tpami.2022.3148210
- Jan 1, 2023
- IEEE Transactions on Pattern Analysis and Machine Intelligence
Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoder and a language model for text generation. During these years, both components have evolved considerably through the exploitation of object regions, attributes, the introduction of multi-modal connections, fully-attentive approaches, and BERT-like early-fusion strategies. However, regardless of the impressive results, research in image captioning has not reached a conclusive answer yet. This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics. In this respect, we quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies. Moreover, many variants of the problem and its open challenges are discussed. The final goal of this work is to serve as a tool for understanding the existing literature and highlighting the future directions for a research area where Computer Vision and Natural Language Processing can find an optimal synergy.
- Research Article
7
- 10.1145/3483597
- Dec 13, 2021
- ACM Transactions on Asian and Low-Resource Language Information Processing
Image captioning refers to the process of generating a textual description that describes objects and activities present in a given image. It connects two fields of artificial intelligence, computer vision, and natural language processing. Computer vision and natural language processing deal with image understanding and language modeling, respectively. In the existing literature, most of the works have been carried out for image captioning in the English language. This article presents a novel method for image captioning in the Hindi language using encoder–decoder based deep learning architecture with efficient channel attention. The key contribution of this work is the deployment of an efficient channel attention mechanism with bahdanau attention and a gated recurrent unit for developing an image captioning model in the Hindi language. Color images usually consist of three channels, namely red, green, and blue. The channel attention mechanism focuses on an image’s important channel while performing the convolution, which is basically to assign higher importance to specific channels over others. The channel attention mechanism has been shown to have great potential for improving the efficiency of deep convolution neural networks (CNNs). The proposed encoder–decoder architecture utilizes the recently introduced ECA-NET CNN to integrate the channel attention mechanism. Hindi is the fourth most spoken language globally, widely spoken in India and South Asia; it is India’s official language. By translating the well-known MSCOCO dataset from English to Hindi, a dataset for image captioning in Hindi is manually created. The efficiency of the proposed method is compared with other baselines in terms of Bilingual Evaluation Understudy (BLEU) scores, and the results obtained illustrate that the method proposed outperforms other baselines. The proposed method has attained improvements of 0.59%, 2.51%, 4.38%, and 3.30% in terms of BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores, respectively, with respect to the state-of-the-art. Qualities of the generated captions are further assessed manually in terms of adequacy and fluency to illustrate the proposed method’s efficacy.
- Conference Article
1
- 10.23919/fruct52173.2021.9435534
- May 12, 2021
Recent advances in deep learning have enabled machines to see, hear, and even speak. In some cases, with the help of deep learning, machines have also outperformed humans in these complex tasks. Such improvements have reignited interest in many fields. Image captioning, which is considered an intersection between computer vision and natural language processing, has recently received significant attention. Deep learning-based image captioning models represent a great improvement on traditional methods. However, most of the work done in image captioning is based on supervised deep learning methods. Recently, unsupervised image captioning has started to gather momentum. This paper presents the first survey that focuses on unsupervised and semi-supervised image captioning techniques and methods. Additionally, the survey shows how such methods can be used with different data availability and data pairing settings, where some methods can be used with paired data, while others can be used with unpaired data. Furthermore, special cases of unpaired data, such as cross-domain and cross-lingual image captioning, are also discussed. Finally, the survey presents a discussion on challenges and future research directions of image captioning.
- Research Article
42
- 10.1007/s11042-020-09294-7
- Jul 17, 2020
- Multimedia Tools and Applications
Image captioning is a task generating the natural semantic description of the given image, which plays an essential role for machines to understand the content of the image. Remote sensing image captioning is a part of the field. Most of the current remote sensing image captioning models failed to fully utilize the semantic information in images and suffered the overfitting problem induced by the small size of the dataset. To this end, we propose a new model using the Transformer to decode the image features to target sentences. For making the Transformer more adaptive to the remote sensing image captioning task, we additionally employ dropout layers, residual connections, and adaptive feature fusion in the Transformer. Reinforcement Learning is then applied to enhance the quality of the generated sentences. We demonstrate the validity of our proposed model on three remote sensing image captioning datasets. Our model obtains all seven higher scores on the Sydney Dataset and Remote Sensing Image Caption Dataset (RSICD), four higher scores on UCM dataset, which indicates that the proposed methods perform better than the previous state of the art models in remote sensing image caption generation.
- Research Article
42
- 10.3390/rs12060939
- Mar 13, 2020
- Remote Sensing
The task of image captioning involves the generation of a sentence that can describe an image appropriately, which is the intersection of computer vision and natural language. Although the research on remote sensing image captions has just started, it has great significance. The attention mechanism is inspired by the way humans think, which is widely used in remote sensing image caption tasks. However, the attention mechanism currently used in this task is mainly aimed at images, which is too simple to express such a complex task well. Therefore, in this paper, we propose a multi-level attention model, which is a closer imitation of attention mechanisms of human beings. This model contains three attention structures, which represent the attention to different areas of the image, the attention to different words, and the attention to vision and semantics. Experiments show that our model has achieved better results than before, which is currently state-of-the-art. In addition, the existing datasets for remote sensing image captioning contain a large number of errors. Therefore, in this paper, a lot of work has been done to modify the existing datasets in order to promote the research of remote sensing image captioning.
- Research Article
4
- 10.1145/3645029
- Mar 9, 2024
- ACM Transactions on Asian and Low-Resource Language Information Processing
This work proposes an efficient Summary Caption Technique that considers the multimodal summary and image captions as input to retrieve the correspondence images from the captions that are highly influential to the multimodal summary. Matching a multimodal summary with an appropriate image is a challenging task in computer vision and natural language processing. Merging in these fields is tedious, though the research community has steadily focused on cross-modal retrieval. These issues include the visual question-answering, matching queries with the images, and semantic relationship matching between two modalities for retrieving the corresponding image. Relevant works consider questions to match the relationship of visual information and object detection and to match the text with visual information and employing structural-level representation to align the images with the text. However, these techniques are primarily focused on retrieving the images to text or for image captioning. But less effort has been spent on retrieving relevant images for the multimodal summary. Hence, our proposed technique extracts and merge features in the Hybrid Image Text layer and captions in the semantic embeddings with word2vec where the contextual features and semantic relationships are compared and matched with each vector between the modalities, with cosine semantic similarity. In cross-modal retrieval, we achieve top five related images and align the relevant images to the multimodal summary that achieves the highest cosine score among the retrieved images. The model has been trained with seq-to-seq modal with 100 epochs, besides reducing the information loss by the sparse categorical cross entropy. Further, experimenting with the multimodal summarization with multimodal output dataset, in cross-modal retrieval, helps to evaluate the quality of image alignment with an image-precision metric that demonstrate the best results.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.