Articles published on Image Captioning
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
1386 Search results
Sort by Recency
- New
- Research Article
- 10.1016/j.eswa.2026.131248
- May 1, 2026
- Expert Systems with Applications
- Weiwei Xiang + 5 more
ImCapDA: Fine-tuning CLIP via image captions for unsupervised domain adaptation
- New
- Research Article
- 10.1142/s0219843626500076
- Apr 22, 2026
- International Journal of Humanoid Robotics
- P Girija + 1 more
Image captioning intends to automatically produce relevant and descriptive text for a specified image, integrating Natural Language Processing (NLP) and Computer Vision (CV) to understand visual content and express it in words. Existing image captioning methods suffer from difficulty in generating accurate and contextually rich captions, which results in captions that lack descriptive quality and alignment with visual content. The objective of this study is to develop an efficient image captioning framework capable of producing accurate and semantically rich captions from images. In this research, a hybrid Attention-reinforced transformer with contrastive learning, Serval-Frigatebird Optimization, Gaussian Error Linear Unit-Long-Short Term Memory (ArCO-SerFO-GLSTM) based Generative Adversarial Image Captioning model is introduced for performing image captioning from a given dataset. The proposed model consists of the ArCO-SerFO generator, the Reinforcement Learning (RL Generator) with a language evaluator and a discriminator. At first, in the ArCO-SerFO generator, the input image is passed through an image encoder to extract visual features and then fed to the caption decoder to generate a sample caption. The generated caption is compared with the ground-truth caption using contrastive loss, which improves the alignment between image features and the caption. In this case, the ArCO model is tuned exploiting Serval- Frigatebird Optimization (SerFO). The system then uses a RL generator, where an image encoder and multi-attention mechanism guide a language decoder to generate refined captions. These captions are evaluated by a language evaluator, and Reinforcement Learning (RL loss) updates the model based on the reward metrics. Finally, both generated captions and groundtruth captions are fed into a GELU-LSTM discriminator, which distinguishes real captions from generated caption. The GELU-LSTM is developed by incorporating a GELU into an LSTM. The developed ArCO-SerFO-GLSTM acquired Recall-Oriented Understudy for Gisting Evaluation-L (Rouge-L) of 60.19%, Mean Average Precision (mAP) of 80.13%, Bilingual Evaluation Understudy (BLEU) of 84.23%, Metric for Evaluation of Translation with Explicit Ordering (METEOR) of 31.99%, Semantic Propositional Image Caption Evaluation (SPICE) of 25.99% and Consensus-based Image Description Evaluation (CIDEr) of 123.3 with the Flickr Image dataset.
- New
- Research Article
- 10.1080/14702029.2026.2646453
- Apr 21, 2026
- Journal of Visual Art Practice
- Jane Birkin
ABSTRACT This article does not attempt to set down guidelines to visual essay writing – this would be impossible due to the complexities of the form. Instead, it provides observations on some of the ways that the visual essay can support different ways thinking with text and image – altering perspectives, creating arguments and building narratives through the sequencing of images and the spatial relationships between image and text – as well as considering the role of image captions. At the same time, and not unconnected, it acknowledges the difficulties inherent in working with non-standard layouts within the parameters of academic publishing. By providing an exchange of ideas (both abstract and practical) that may seem marginal or even contradictory, it is hoped that looking at different aspects of the unique form will help to identify some of the paths that a visual essay might take and that writers might be encouraged to develop their essays in ambitious and experimental ways, considering the diversity of image behaviours and how they interact with text forms. In the longer term, the visual essay form itself and the dialogue around it might be advanced.
- New
- Research Article
- 10.1080/13682199.2026.2658400
- Apr 17, 2026
- The Imaging Science Journal
- Ganesh Khekare + 4 more
ABSTRACT Low visibility, color distortion, and structural complexity are some of the harsh challenges that marine environments must confront. This research affords a transformational image caption structure, specifically designed for the assessment of underwater sceneries. To produce precise and significant captions for underwater images, the suggested technique combines seen-linguistic fusion, contextual and semantic enhancements, and hobby mechanisms. This proposed method consists of unique local language and spatial skills, enabling greater unique interpretation and context identity of complicated scenarios. The model achieves an increased accuracy of 91.40%, which extensively outperforms present techniques. The proposed technique consists of cutting-edge measures like Bilingual Evaluation Understudy (BLEU), Metric for Evaluation of Translation with Explicit ORdering (METEOR) and Consensus-based Image Description Evaluation (CIDEr). Contrast analysis and case studies show that the system may create captioning that is both aesthetically pleasing and linguistically rich, making it a valuable tool for tracking, exploration, and marine recording.
- Research Article
- 10.1145/3796710
- Apr 11, 2026
- ACM Transactions on Multimedia Computing, Communications, and Applications
- Chuanle Song + 4 more
Image captioning is a cross-modal text generation task aimed at understanding the relationships among various objects in an image. Therefore, accurately expressing object–object relations remains a key bottleneck for transformer-based image captioning. Prior methods usually inject semantic and geometric relations once and keep them fixed while only updating visual features, creating a mismatch—evolving visuals vs. frozen relations—that weakens relational guidance and leads to feature entanglement. We propose the Relationship-Experts Transformer (RET), which treats semantic and geometric relations as learnable experts that guide object visual features (students) and co-evolve with them. In RET, we first design the Relationship-Guided Feature Aggregation (RGFA) module, which is analogous to experts-guided student learning, specifically utilizing the relationship kernel (the expert’s knowledge brain) to guide the learning of the object visual features (students). Secondly, we develop the Experts Knowledge Updating (EKU) module, which continuously iterates expert knowledge during training to enhance the expert’s guiding ability over the student. Finally, we design the Student Knowledge Selector (SKS) module to adaptively select object visual features enhanced with different relations under the guidance of semantic and geometric experts to generate descriptive texts embodying semantic and geometric knowledge. Experiments on the MSCOCO dataset demonstrate that our model achieves state-of-the-art performance. All codes are available at https://github.com/songchuanle-1/RET .
- Research Article
1
- 10.1016/j.neunet.2025.108365
- Apr 1, 2026
- Neural networks : the official journal of the International Neural Network Society
- Deguang Chen + 3 more
RCVQA: Visual question answering model based on reading comprehension.
- Research Article
- 10.1016/s0007-0912(26)00105-4
- Apr 1, 2026
- British Journal of Anaesthesia
Associate Editorial Board and cover image caption
- Research Article
- 10.23887/janapati.v15i1.108404
- Mar 31, 2026
- Jurnal Nasional Pendidikan Teknik Informatika (JANAPATI)
- I Putu Bagus Gede Prasetyo Raharja + 1 more
Pre-trained vision-language models such as BLIP have achieved remarkable success in general image captioning tasks. However, their performance on domain-specific applications, particularly cultural heritage documentation, remains limited due to the lack of specialized knowledge and the inability to handle multi-label cultural categories. Full fine-tuning of these large models is computationally expensive and risks catastrophic forgetting, while standard adapter-based methods treat all images uniformly without considering domain-specific class characteristics. This study proposes ML-CAA-BLIP (Multi-Label Cultural-Aware Adapter for BLIP), a novel parameter-efficient adaptation method for Balinese carving image captioning. The proposed method introduces class-specific scaling parameters for each cultural motif category (Barong, Punggel, Keketusan, Gajah, Goak, Cina, and Daun) and employs a learned importance-weighted fusion mechanism to handle multi-label inputs where images contain multiple artistic styles. Experiments conducted on the BaliCarving dataset comprising 2,181 images demonstrate that ML-CAA-BLIP achieves the best BLEU-4 score of 0.2718 (+52.4% improvement over Base BLIP) and ROUGE-L score of 0.5835 (+15.2% improvement) while adding only 903 trainable parameters. The model also shows competitive performance on other metrics including METEOR and BERTScore. These results indicate that cultural-aware adaptation significantly improves domain-specific image captioning while maintaining parameter efficiency, contributing to the digital preservation of Balinese cultural heritage
- Research Article
- 10.29304/jqcsm.2026.18.12477
- Mar 30, 2026
- Journal of Al-Qadisiyah for Computer Science and Mathematics
- Haider Jaber Samawi + 1 more
The task of image captioning, which involves generating descriptive textual content from visual input, is a pivotal challenge in multimodal learning. This research delves into the advancements in image captioning facilitated by Transformer-based models, comparing their performance, architectures, and innovations across various tasks. Traditional models, such as CNNs paired with RNNs, were initially used to extract visual features and generate corresponding captions. However, the introduction of Transformer architectures has significantly enhanced the performance of image captioning systems, allowing for more coherent, context-aware, and grammatically correct captions. This paper explores the evolution of Transformer-based models, with a particular focus on the Encoder-Decoder, Vision-Language Fusion, and End-to-End Transformers models. By analyzing state-of-the-art architectures such as ViT, GPT, BLIP, and CoCa, the study demonstrates how these models address long-range dependencies, utilize self-attention mechanisms, and seamlessly integrate vision and language for improved caption generation. Furthermore, the paper evaluates the strengths, challenges, and limitations of these approaches, including issues related to computational complexity, dataset biases, and caption diversity. Ultimately, this study presents a comprehensive comparison of these models, offering insights into future research directions in the field of image captioning.
- Research Article
- 10.1371/journal.pone.0343823
- Mar 17, 2026
- PLOS One
- Jiquan Liu + 6 more
Surgical image captioning is critical for automated reporting and education but is currently limited by a lack of long-text datasets and the tendency of generic Multimodal Large Language Models (MLLMs) to hallucinate medical details. To address this, we present a comprehensive framework for long-text surgical captioning. First, we construct a verified long-text benchmark extending the EndoVis2018 dataset, utilizing an automated pipeline with expert-in-the-loop validation to transform brief triplets into rich narratives. Second, we investigate domain-specific adaptation strategies for MLLMs. We implement a surgical concept retrieval-augmented generation (RAG) mechanism that dynamically injects specialized knowledge (instruments, actions) into the visual encoder, effectively mitigating domain-specific hallucinations common in generic models. Finally, recognizing the inadequacy of n-gram metrics for long medical text, we establish a robust evaluation protocol using clinically-aligned metrics. Extensive experiments demonstrate that our data-centric and retrieval-enhanced approach significantly outperforms baselines in producing clinically accurate, coherent long descriptions.
- Research Article
- 10.1371/journal.pone.0345012
- Mar 16, 2026
- PLOS One
- Priyanka Panchal + 4 more
This paper provides a novel deep learning model for captioning of images by using an advanced vision transformer architecture with a powerful LLM. Proposed models show a significant improvement over traditional CNN-RNN hybrids and existing transformer-based approaches by integrating a unique cross-attention mechanism that enables deep alignment between linguistic context and visual features. We show the superiority of our proposed architecture through extensive evaluation on different datasets like MSCOCO, Flickr30K, and NoCaps. The proposed model consistently shows good performance for leading methods such as GIT, BLIP-2, and CoCa across a comprehensive suite of metrics. On the MS COCO dataset, the BLEU-4, METEOR, and CIDEr scores of proposed models are equal to 0.495, 0.390, and 1.32, respectively. In this paper, we have critically analyzed the key challenges of this field, like enhancing caption diversity, ensuring robust multimodal alignment, and mitigating inherent biases. By providing a new performance level, the proposed model provides a source of reference for the next generation of image captioning systems. The results show the efficiency of our fusion strategy and facilitate the development of techniques that use models that can produce more precise, contextually rich, and human-like image depictions. This work supports SDG 9 (Industry, Innovation, and Infrastructure) by advancing multimodal AI systems, and SDG 4 (Quality Education) by enabling intelligent and accessible image understanding technologies.
- Research Article
- 10.3390/s26061863
- Mar 16, 2026
- Sensors (Basel, Switzerland)
- Jingzhe Nie + 4 more
Object counting in remote sensing images is valuable for applications such as urban planning and environmental monitoring. However, it remains challenging due to heterogeneous annotations, semantic ambiguity in open-vocabulary queries, and performance degradation of small targets. To address these limitations, we propose DR-CLIP (Deformable Remote CLIP), a vision-language model for remote sensing image counting that incorporates deformable visual feature extraction with text-guided prediction. DR-CLIP includes a (1) Region-to-Instruction (R2I) mechanism to convert points, bounding boxes, and polygons into a unified image-text training representation, a (2) Multi-scale Deformable Attention (MSDA) to enhance discriminative feature extraction across extreme scale variations and cluttered backgrounds, and a (3) Text-Guided Counting Head that establishes robust cross-modal alignment through contrastive learning, achieving open-vocabulary counting capability without category-specific retraining. On DOTA-v2.0, DR-CLIP achieves a Mean Absolute Error (MAE) of 2.34 and a Root Mean Squared Error (RMSE) of 3.89, outperforming baselines by 19.0% in MAE. The MSDA module significantly increases Small-Object Recall (SOR) to 0.824, which is especially effective in situations involving dense and small object counting. In cross-modal retrieval, DR-CLIP attains R@1 scores of 68.3% (image-to-text) and 72.1% (text-to-image) on the Remote Sensing Image Captioning Dataset (RSICD). The framework generalizes robustly, with only 8.7% performance degradation in cross-domain tests, which is significantly lower than the 23.4% drop observed in baseline methods.
- Research Article
- 10.1177/00220345261424242
- Mar 15, 2026
- Journal of dental research
- M-X Li + 8 more
Vision-language models (VLMs) have demonstrated significant potential in medical image analysis, yet their application in intraoral photography remains largely underexplored due to the lack of fine-grained, annotated datasets and comprehensive benchmarks. To address this, we present MetaDent, a comprehensive resource that includes 1) a novel and large-scale dentistry image dataset collected from clinical, public, and web sources; 2) a semistructured annotation framework designed to capture the hierarchical and clinically nuanced nature of dental photography; and 3) comprehensive benchmark suites for evaluating state-of-the-art VLMs on clinical image understanding. Our labeling approach combines a high-level image summary with point-by-point, free-text descriptions of abnormalities. This method enables rich, scalable, and task-agnostic representations. We curated 60,669 dental images from diverse sources and annotated a representative subset of 2,588 images using this meta-labeling scheme. Leveraging large language models (LLMs), we derive standardized benchmarks: approximately 15,000 visual question answering (VQA) pairs and an 18-class multilabel classification dataset, which we validated with human review and error analysis to justify that the LLM-driven transition reliably preserves fidelity and semantic accuracy. We then evaluate state-of-the-art VLMs across VQA, classification, and image captioning tasks. Quantitative results reveal that even the most advanced models struggle with a fine-grained understanding of intraoral scenes, achieving moderate accuracy (e.g., less than 70% in VQA) and producing inconsistent or incomplete descriptions in image captioning. These findings underscore the gap between general-purpose VLMs and the demands of specialized models, highlighting the need for domain-adapted training and more sophisticated evaluation protocols to assist professional dental practice and community oral health efforts. We publicly release our dataset, annotations, and tools to foster reproducible research and accelerate the development of vision-language systems for dental applications.
- Research Article
- 10.55041/ijsrem57750
- Mar 15, 2026
- International Journal of Scientific Research in Engineering and Management
- Dr Md Sirajul Huque + 3 more
Abstract - Visual Question Answering (VQA) is a challenging multimodal task requiring joint understanding of visual content and natural language questions. Traditional VQA systems rely on complex attention-based architectures demanding significant computational resources and GPU training. This paper proposes an efficient and scalable VQA system using a pretrained CLIP (Contrastive Language–Image Pretraining) ViT-B/32 model for open-ended English language queries. The proposed approach extracts semantically aligned image and question embeddings using a frozen CLIP backbone and combines them through a lightweight Multi-Layer Perceptron (MLP) classifier for answer prediction. Experiments on the VizWiz dataset — a real-world benchmark of images captured by visually impaired users — demonstrate competitive performance, achieving a Top-1 accuracy of 40.6% and Top-5 accuracy of 71.7%, trained entirely on CPU without end-to-end fine-tuning. A Flask-based web application supporting user authentication, image upload, and real-time Top-5 predictions with confidence scores is also demonstrated. Key Words: Visual Question Answering, CLIP, VizWiz Dataset, MLP Classifier, Multimodal Learning, Deep Learning
- Research Article
- 10.1016/j.neucom.2025.132366
- Mar 1, 2026
- Neurocomputing
- Simin Xu + 4 more
SABA: Scene-aware bidirectional backdoor attack against multimodal learning
- Research Article
- 10.1002/cpe.70622
- Mar 1, 2026
- Concurrency and Computation: Practice and Experience
- Jiayu Bai + 3 more
ABSTRACT In image captioning tasks, many studies have shown that using both grid feature and region feature from images helps models better understand visual content, leading to more accurate descriptions. However, to save training time and keep feature extraction efficient, most research uses pre‐trained models to get these grid and region feature. This method could provide the model with diverse feature, but the pre‐trained models used for feature extraction were trained for different purposes, leading to variations in their focus. As a result, many of the extracted visual feature may not be well‐suited for the current task, introducing a significant amount of redundant information. To resolve the feature discrepancies and redundancy caused by the differing focuses of these models, we propose a model named Region Guide Grid Cross Transformer (RGGT) for image captioning. In our model, since region feature tend to lose more global visual‐semantic information compared to grid feature, the model primarily uses grid feature as the main during encoding stage. We use multi‐head cross‐attention mechanism that allows region feature to guide the grid feature, generating new grid feature enriched with both global semantics and target‐region semantics. Furthermore, a feature refinement module based on sparse scan attention is introduced to purify the visual feature and produce new region feature derived from the refined new grid feature. In the decoding stage, to better use the target region semantics from the new region feature while preserving global feature, we further integrate and control redundancy between the new grid feature and new region feature. To achieve this, we propose a feature deep fusion module based on a gate mechanism. This module combines text feature with both region and grid feature through their respective multi‐head cross attention mechanisms. Using a gate mechanism, it automatically learns to control the proportion of each feature in the final fusion, enabling more accurate integration of the different feature information. We evaluate our RGGT model on the MSCOCO2014 dataset, with experimental results demonstrating its outstanding performance. The model significantly outperforms both comparable approaches and state‐of‐the‐art methods. The code will be made available on https://github.com/Kickdog1022/RGGT_image_caption .
- Research Article
- 10.56578/ataiml050105
- Mar 1, 2026
- Acadlore Transactions on AI and Machine Learning
- Abebe Kindie Awuraris + 2 more
This paper explored how generative artificial intelligence (AI) could enhance the digital accessibility of individuals with visual, auditory, and cognitive impairments.It aims to develop an adaptive and context-sensitive system to dynamically customize content in accordance with users' needs.The proposed system creates text simplification with generative AI models like Generative Pretrained Transformer 3 (GPT-3), and caption images with Contrastive Language-Image Pre-Training (CLIP).It adapts users' reactions with reinforcement learning, to enable the generation of real-time and personalized content.This project tested the system performance with mixed data, including texts, images, and videos.The outcomes revealed that the accessibility of the content had been significantly increased.At the same time, the Flesch-Kincaid Grade Level was reduced by 50% through text simplification, and the bilingual evaluation understudy (BLEU) score was ranked at 0.74 in the case of image captioning.User satisfaction had increased by 15% after feedback corrections.In addition to these results, the system demonstrated high effectiveness in supporting auditory-impaired users by achieving a subtitle synchronization accuracy of 94.6% in video content, and increasing auditory user satisfaction by 18% during accessibility evaluations.This study helped develop AI-based accessibility and provide more inclusive online environment for people with disabilities, thus facilitating their access to online content.In conclusion, the proposed system is more convenient and could offer a broader range of individual and time-sensitive user experiences, compared to the current accessibility models.
- Research Article
- 10.1016/s0007-0912(26)00068-1
- Mar 1, 2026
- British Journal of Anaesthesia
Associate Editorial Board and cover image caption
- Research Article
- 10.1016/j.knosys.2026.115272
- Mar 1, 2026
- Knowledge-Based Systems
- Anusha P + 1 more
Image captioning system for natural language processing using optimized attention-augmented residual convolutional neural network
- Research Article
- 10.1111/cgf.70398
- Feb 28, 2026
- Computer Graphics Forum
- Yuzhe Lu + 7 more
Abstract As a task at the intersection of computer vision and natural language processing, image captioning offers significant application value in domains such as intelligent human–computer interaction, accessibility support and multimedia content retrieval. The primary objective is to generate natural language descriptions by interpreting visual features, traditionally relying on heterogeneous single‐stream grid features and region features. However, existing approaches face limitations: grid features struggle to balance global semantic perception with local detail analysis, and region features exhibit weakened spatial modelling efficacy due to sparse semantic correlations. Furthermore, fusing heterogeneous visual features often lacks effective control over complementarity and redundancy, leading to descriptions prone to semantic bias or detail omission. To address these challenges, we propose a novel Multi‐Gated Dual‐Stream Visual Feature Fusion (MGDSF) for Image Captioning. Our approach enhances the semantic accuracy and completeness of generated captions through dual‐stream feature extraction and a multi‐gated fusion (MGF) mechanism. First, we employ a Mamba‐like linear attention mechanism to construct a grid feature network with hierarchical positional awareness. This network achieves global modelling while maintaining local sensitivity by dynamically modulating information flow. Second, based on the Detection Transformer (DETR) framework, we design a region feature extractor to provide complementary local object visual information. Finally, we introduce a MGF module that balances the complementarity of dual‐stream visual features and suppresses cross‐modal information redundancy via multiple context‐aware gates, thereby achieving fine‐grained visual‐semantic alignment. Experiments on MS COCO demonstrate that MGDSF surpasses existing methods on multiple evaluation metrics, achieving METEOR, ROUGE‐L and CIDEr scores of 30.0%, 59.8% and 140.1%, respectively. These results validate the effectiveness of our proposed method and indicate its broad application potential.