Discovery Logo
Sign In
Search
Paper
Search Paper
R Discovery for Libraries Pricing Sign In
  • Home iconHome
  • My Feed iconMy Feed
  • Search Papers iconSearch Papers
  • Library iconLibrary
  • Explore iconExplore
  • Ask R Discovery iconAsk R Discovery Star Left icon
  • Literature Review iconLiterature Review NEW
  • Chat PDF iconChat PDF Star Left icon
  • Citation Generator iconCitation Generator
  • Chrome Extension iconChrome Extension
    External link
  • Use on ChatGPT iconUse on ChatGPT
    External link
  • iOS App iconiOS App
    External link
  • Android App iconAndroid App
    External link
  • Contact Us iconContact Us
    External link
  • Paperpal iconPaperpal
    External link
  • Mind the Graph iconMind the Graph
    External link
  • Journal Finder iconJournal Finder
    External link
Discovery Logo menuClose menu
  • Home iconHome
  • My Feed iconMy Feed
  • Search Papers iconSearch Papers
  • Library iconLibrary
  • Explore iconExplore
  • Ask R Discovery iconAsk R Discovery Star Left icon
  • Literature Review iconLiterature Review NEW
  • Chat PDF iconChat PDF Star Left icon
  • Citation Generator iconCitation Generator
  • Chrome Extension iconChrome Extension
    External link
  • Use on ChatGPT iconUse on ChatGPT
    External link
  • iOS App iconiOS App
    External link
  • Android App iconAndroid App
    External link
  • Contact Us iconContact Us
    External link
  • Paperpal iconPaperpal
    External link
  • Mind the Graph iconMind the Graph
    External link
  • Journal Finder iconJournal Finder
    External link
features
  • Audio Papers iconAudio Papers
  • Paper Translation iconPaper Translation
  • Chrome Extension iconChrome Extension
Content Type
  • Journal Articles iconJournal Articles
  • Conference Papers iconConference Papers
  • Preprints iconPreprints
  • Seminars by Cassyni iconSeminars by Cassyni
More
  • R Discovery for Libraries iconR Discovery for Libraries
  • Research Areas iconResearch Areas
  • Topics iconTopics
  • Resources iconResources

Related Topics

  • Pre-trained Language Models
  • Pre-trained Language Models
  • Semantic Parsing
  • Semantic Parsing

Articles published on Image Captioning

Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
1386 Search results
Sort by
Recency
  • New
  • Research Article
  • 10.1016/j.eswa.2026.131248
ImCapDA: Fine-tuning CLIP via image captions for unsupervised domain adaptation
  • May 1, 2026
  • Expert Systems with Applications
  • Weiwei Xiang + 5 more

ImCapDA: Fine-tuning CLIP via image captions for unsupervised domain adaptation

  • New
  • Research Article
  • 10.1142/s0219843626500076
A Hybrid Generative Adversarial Framework for Image Captioning with Reinforced Attention and Metaheuristic Optimization
  • Apr 22, 2026
  • International Journal of Humanoid Robotics
  • P Girija + 1 more

Image captioning intends to automatically produce relevant and descriptive text for a specified image, integrating Natural Language Processing (NLP) and Computer Vision (CV) to understand visual content and express it in words. Existing image captioning methods suffer from difficulty in generating accurate and contextually rich captions, which results in captions that lack descriptive quality and alignment with visual content. The objective of this study is to develop an efficient image captioning framework capable of producing accurate and semantically rich captions from images. In this research, a hybrid Attention-reinforced transformer with contrastive learning, Serval-Frigatebird Optimization, Gaussian Error Linear Unit-Long-Short Term Memory (ArCO-SerFO-GLSTM) based Generative Adversarial Image Captioning model is introduced for performing image captioning from a given dataset. The proposed model consists of the ArCO-SerFO generator, the Reinforcement Learning (RL Generator) with a language evaluator and a discriminator. At first, in the ArCO-SerFO generator, the input image is passed through an image encoder to extract visual features and then fed to the caption decoder to generate a sample caption. The generated caption is compared with the ground-truth caption using contrastive loss, which improves the alignment between image features and the caption. In this case, the ArCO model is tuned exploiting Serval- Frigatebird Optimization (SerFO). The system then uses a RL generator, where an image encoder and multi-attention mechanism guide a language decoder to generate refined captions. These captions are evaluated by a language evaluator, and Reinforcement Learning (RL loss) updates the model based on the reward metrics. Finally, both generated captions and groundtruth captions are fed into a GELU-LSTM discriminator, which distinguishes real captions from generated caption. The GELU-LSTM is developed by incorporating a GELU into an LSTM. The developed ArCO-SerFO-GLSTM acquired Recall-Oriented Understudy for Gisting Evaluation-L (Rouge-L) of 60.19%, Mean Average Precision (mAP) of 80.13%, Bilingual Evaluation Understudy (BLEU) of 84.23%, Metric for Evaluation of Translation with Explicit Ordering (METEOR) of 31.99%, Semantic Propositional Image Caption Evaluation (SPICE) of 25.99% and Consensus-based Image Description Evaluation (CIDEr) of 123.3 with the Flickr Image dataset.

  • New
  • Research Article
  • 10.1080/14702029.2026.2646453
The visual essay: thoughts on the form
  • Apr 21, 2026
  • Journal of Visual Art Practice
  • Jane Birkin

ABSTRACT This article does not attempt to set down guidelines to visual essay writing – this would be impossible due to the complexities of the form. Instead, it provides observations on some of the ways that the visual essay can support different ways thinking with text and image – altering perspectives, creating arguments and building narratives through the sequencing of images and the spatial relationships between image and text – as well as considering the role of image captions. At the same time, and not unconnected, it acknowledges the difficulties inherent in working with non-standard layouts within the parameters of academic publishing. By providing an exchange of ideas (both abstract and practical) that may seem marginal or even contradictory, it is hoped that looking at different aspects of the unique form will help to identify some of the paths that a visual essay might take and that writers might be encouraged to develop their essays in ambitious and experimental ways, considering the diversity of image behaviours and how they interact with text forms. In the longer term, the visual essay form itself and the dialogue around it might be advanced.

  • New
  • Research Article
  • 10.1080/13682199.2026.2658400
Deepseacap: a transformer approach for accurate and contextual captioning of underwater imagery
  • Apr 17, 2026
  • The Imaging Science Journal
  • Ganesh Khekare + 4 more

ABSTRACT Low visibility, color distortion, and structural complexity are some of the harsh challenges that marine environments must confront. This research affords a transformational image caption structure, specifically designed for the assessment of underwater sceneries. To produce precise and significant captions for underwater images, the suggested technique combines seen-linguistic fusion, contextual and semantic enhancements, and hobby mechanisms. This proposed method consists of unique local language and spatial skills, enabling greater unique interpretation and context identity of complicated scenarios. The model achieves an increased accuracy of 91.40%, which extensively outperforms present techniques. The proposed technique consists of cutting-edge measures like Bilingual Evaluation Understudy (BLEU), Metric for Evaluation of Translation with Explicit ORdering (METEOR) and Consensus-based Image Description Evaluation (CIDEr). Contrast analysis and case studies show that the system may create captioning that is both aesthetically pleasing and linguistically rich, making it a valuable tool for tracking, exploration, and marine recording.

  • Research Article
  • 10.1145/3796710
Relationship-Experts Transformer for Image Captioning
  • Apr 11, 2026
  • ACM Transactions on Multimedia Computing, Communications, and Applications
  • Chuanle Song + 4 more

Image captioning is a cross-modal text generation task aimed at understanding the relationships among various objects in an image. Therefore, accurately expressing object–object relations remains a key bottleneck for transformer-based image captioning. Prior methods usually inject semantic and geometric relations once and keep them fixed while only updating visual features, creating a mismatch—evolving visuals vs. frozen relations—that weakens relational guidance and leads to feature entanglement. We propose the Relationship-Experts Transformer (RET), which treats semantic and geometric relations as learnable experts that guide object visual features (students) and co-evolve with them. In RET, we first design the Relationship-Guided Feature Aggregation (RGFA) module, which is analogous to experts-guided student learning, specifically utilizing the relationship kernel (the expert’s knowledge brain) to guide the learning of the object visual features (students). Secondly, we develop the Experts Knowledge Updating (EKU) module, which continuously iterates expert knowledge during training to enhance the expert’s guiding ability over the student. Finally, we design the Student Knowledge Selector (SKS) module to adaptively select object visual features enhanced with different relations under the guidance of semantic and geometric experts to generate descriptive texts embodying semantic and geometric knowledge. Experiments on the MSCOCO dataset demonstrate that our model achieves state-of-the-art performance. All codes are available at https://github.com/songchuanle-1/RET .

  • Research Article
  • Cite Count Icon 1
  • 10.1016/j.neunet.2025.108365
RCVQA: Visual question answering model based on reading comprehension.
  • Apr 1, 2026
  • Neural networks : the official journal of the International Neural Network Society
  • Deguang Chen + 3 more

RCVQA: Visual question answering model based on reading comprehension.

  • Research Article
  • 10.1016/s0007-0912(26)00105-4
Associate Editorial Board and cover image caption
  • Apr 1, 2026
  • British Journal of Anaesthesia

Associate Editorial Board and cover image caption

  • Research Article
  • 10.23887/janapati.v15i1.108404
ML-CAA-BLIP: Multi-Label Cultural-Aware Adapter for Balinese Carving Image Captioning
  • Mar 31, 2026
  • Jurnal Nasional Pendidikan Teknik Informatika (JANAPATI)
  • I Putu Bagus Gede Prasetyo Raharja + 1 more

Pre-trained vision-language models such as BLIP have achieved remarkable success in general image captioning tasks. However, their performance on domain-specific applications, particularly cultural heritage documentation, remains limited due to the lack of specialized knowledge and the inability to handle multi-label cultural categories. Full fine-tuning of these large models is computationally expensive and risks catastrophic forgetting, while standard adapter-based methods treat all images uniformly without considering domain-specific class characteristics. This study proposes ML-CAA-BLIP (Multi-Label Cultural-Aware Adapter for BLIP), a novel parameter-efficient adaptation method for Balinese carving image captioning. The proposed method introduces class-specific scaling parameters for each cultural motif category (Barong, Punggel, Keketusan, Gajah, Goak, Cina, and Daun) and employs a learned importance-weighted fusion mechanism to handle multi-label inputs where images contain multiple artistic styles. Experiments conducted on the BaliCarving dataset comprising 2,181 images demonstrate that ML-CAA-BLIP achieves the best BLEU-4 score of 0.2718 (+52.4% improvement over Base BLIP) and ROUGE-L score of 0.5835 (+15.2% improvement) while adding only 903 trainable parameters. The model also shows competitive performance on other metrics including METEOR and BERTScore. These results indicate that cultural-aware adaptation significantly improves domain-specific image captioning while maintaining parameter efficiency, contributing to the digital preservation of Balinese cultural heritage

  • Research Article
  • 10.29304/jqcsm.2026.18.12477
From Pixels to Sentence: A Comprehensive Study of Transformers-Based Models for Image Captioning
  • Mar 30, 2026
  • Journal of Al-Qadisiyah for Computer Science and Mathematics
  • Haider Jaber Samawi + 1 more

The task of image captioning, which involves generating descriptive textual content from visual input, is a pivotal challenge in multimodal learning. This research delves into the advancements in image captioning facilitated by Transformer-based models, comparing their performance, architectures, and innovations across various tasks. Traditional models, such as CNNs paired with RNNs, were initially used to extract visual features and generate corresponding captions. However, the introduction of Transformer architectures has significantly enhanced the performance of image captioning systems, allowing for more coherent, context-aware, and grammatically correct captions. This paper explores the evolution of Transformer-based models, with a particular focus on the Encoder-Decoder, Vision-Language Fusion, and End-to-End Transformers models. By analyzing state-of-the-art architectures such as ViT, GPT, BLIP, and CoCa, the study demonstrates how these models address long-range dependencies, utilize self-attention mechanisms, and seamlessly integrate vision and language for improved caption generation. Furthermore, the paper evaluates the strengths, challenges, and limitations of these approaches, including issues related to computational complexity, dataset biases, and caption diversity. Ultimately, this study presents a comprehensive comparison of these models, offering insights into future research directions in the field of image captioning.

  • Research Article
  • 10.1371/journal.pone.0343823
Long-text caption generation for surgical image with a concept retrieval augmented large multimodal model
  • Mar 17, 2026
  • PLOS One
  • Jiquan Liu + 6 more

Surgical image captioning is critical for automated reporting and education but is currently limited by a lack of long-text datasets and the tendency of generic Multimodal Large Language Models (MLLMs) to hallucinate medical details. To address this, we present a comprehensive framework for long-text surgical captioning. First, we construct a verified long-text benchmark extending the EndoVis2018 dataset, utilizing an automated pipeline with expert-in-the-loop validation to transform brief triplets into rich narratives. Second, we investigate domain-specific adaptation strategies for MLLMs. We implement a surgical concept retrieval-augmented generation (RAG) mechanism that dynamically injects specialized knowledge (instruments, actions) into the visual encoder, effectively mitigating domain-specific hallucinations common in generic models. Finally, recognizing the inadequacy of n-gram metrics for long medical text, we establish a robust evaluation protocol using clinically-aligned metrics. Extensive experiments demonstrate that our data-centric and retrieval-enhanced approach significantly outperforms baselines in producing clinically accurate, coherent long descriptions.

  • Research Article
  • 10.1371/journal.pone.0345012
Deep learning–driven image captioning: Progress through transformers and large language models
  • Mar 16, 2026
  • PLOS One
  • Priyanka Panchal + 4 more

This paper provides a novel deep learning model for captioning of images by using an advanced vision transformer architecture with a powerful LLM. Proposed models show a significant improvement over traditional CNN-RNN hybrids and existing transformer-based approaches by integrating a unique cross-attention mechanism that enables deep alignment between linguistic context and visual features. We show the superiority of our proposed architecture through extensive evaluation on different datasets like MSCOCO, Flickr30K, and NoCaps. The proposed model consistently shows good performance for leading methods such as GIT, BLIP-2, and CoCa across a comprehensive suite of metrics. On the MS COCO dataset, the BLEU-4, METEOR, and CIDEr scores of proposed models are equal to 0.495, 0.390, and 1.32, respectively. In this paper, we have critically analyzed the key challenges of this field, like enhancing caption diversity, ensuring robust multimodal alignment, and mitigating inherent biases. By providing a new performance level, the proposed model provides a source of reference for the next generation of image captioning systems. The results show the efficiency of our fusion strategy and facilitate the development of techniques that use models that can produce more precise, contextually rich, and human-like image depictions. This work supports SDG 9 (Industry, Innovation, and Infrastructure) by advancing multimodal AI systems, and SDG 4 (Quality Education) by enabling intelligent and accessible image understanding technologies.

  • Research Article
  • 10.3390/s26061863
DR-CLIP: A Deformable Vision-Language Model for Scale-Invariant Object Counting in Remote Sensing Images.
  • Mar 16, 2026
  • Sensors (Basel, Switzerland)
  • Jingzhe Nie + 4 more

Object counting in remote sensing images is valuable for applications such as urban planning and environmental monitoring. However, it remains challenging due to heterogeneous annotations, semantic ambiguity in open-vocabulary queries, and performance degradation of small targets. To address these limitations, we propose DR-CLIP (Deformable Remote CLIP), a vision-language model for remote sensing image counting that incorporates deformable visual feature extraction with text-guided prediction. DR-CLIP includes a (1) Region-to-Instruction (R2I) mechanism to convert points, bounding boxes, and polygons into a unified image-text training representation, a (2) Multi-scale Deformable Attention (MSDA) to enhance discriminative feature extraction across extreme scale variations and cluttered backgrounds, and a (3) Text-Guided Counting Head that establishes robust cross-modal alignment through contrastive learning, achieving open-vocabulary counting capability without category-specific retraining. On DOTA-v2.0, DR-CLIP achieves a Mean Absolute Error (MAE) of 2.34 and a Root Mean Squared Error (RMSE) of 3.89, outperforming baselines by 19.0% in MAE. The MSDA module significantly increases Small-Object Recall (SOR) to 0.824, which is especially effective in situations involving dense and small object counting. In cross-modal retrieval, DR-CLIP attains R@1 scores of 68.3% (image-to-text) and 72.1% (text-to-image) on the Remote Sensing Image Captioning Dataset (RSICD). The framework generalizes robustly, with only 8.7% performance degradation in cross-domain tests, which is significantly lower than the 23.4% drop observed in baseline methods.

  • Research Article
  • 10.1177/00220345261424242
MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry.
  • Mar 15, 2026
  • Journal of dental research
  • M-X Li + 8 more

Vision-language models (VLMs) have demonstrated significant potential in medical image analysis, yet their application in intraoral photography remains largely underexplored due to the lack of fine-grained, annotated datasets and comprehensive benchmarks. To address this, we present MetaDent, a comprehensive resource that includes 1) a novel and large-scale dentistry image dataset collected from clinical, public, and web sources; 2) a semistructured annotation framework designed to capture the hierarchical and clinically nuanced nature of dental photography; and 3) comprehensive benchmark suites for evaluating state-of-the-art VLMs on clinical image understanding. Our labeling approach combines a high-level image summary with point-by-point, free-text descriptions of abnormalities. This method enables rich, scalable, and task-agnostic representations. We curated 60,669 dental images from diverse sources and annotated a representative subset of 2,588 images using this meta-labeling scheme. Leveraging large language models (LLMs), we derive standardized benchmarks: approximately 15,000 visual question answering (VQA) pairs and an 18-class multilabel classification dataset, which we validated with human review and error analysis to justify that the LLM-driven transition reliably preserves fidelity and semantic accuracy. We then evaluate state-of-the-art VLMs across VQA, classification, and image captioning tasks. Quantitative results reveal that even the most advanced models struggle with a fine-grained understanding of intraoral scenes, achieving moderate accuracy (e.g., less than 70% in VQA) and producing inconsistent or incomplete descriptions in image captioning. These findings underscore the gap between general-purpose VLMs and the demands of specialized models, highlighting the need for domain-adapted training and more sophisticated evaluation protocols to assist professional dental practice and community oral health efforts. We publicly release our dataset, annotations, and tools to foster reproducible research and accelerate the development of vision-language systems for dental applications.

  • Research Article
  • 10.55041/ijsrem57750
English Visual Question Answering: Building a Culturally Relevant Dataset from Image Captions
  • Mar 15, 2026
  • International Journal of Scientific Research in Engineering and Management
  • Dr Md Sirajul Huque + 3 more

Abstract - Visual Question Answering (VQA) is a challenging multimodal task requiring joint understanding of visual content and natural language questions. Traditional VQA systems rely on complex attention-based architectures demanding significant computational resources and GPU training. This paper proposes an efficient and scalable VQA system using a pretrained CLIP (Contrastive Language–Image Pretraining) ViT-B/32 model for open-ended English language queries. The proposed approach extracts semantically aligned image and question embeddings using a frozen CLIP backbone and combines them through a lightweight Multi-Layer Perceptron (MLP) classifier for answer prediction. Experiments on the VizWiz dataset — a real-world benchmark of images captured by visually impaired users — demonstrate competitive performance, achieving a Top-1 accuracy of 40.6% and Top-5 accuracy of 71.7%, trained entirely on CPU without end-to-end fine-tuning. A Flask-based web application supporting user authentication, image upload, and real-time Top-5 predictions with confidence scores is also demonstrated. Key Words: Visual Question Answering, CLIP, VizWiz Dataset, MLP Classifier, Multimodal Learning, Deep Learning

  • Research Article
  • 10.1016/j.neucom.2025.132366
SABA: Scene-aware bidirectional backdoor attack against multimodal learning
  • Mar 1, 2026
  • Neurocomputing
  • Simin Xu + 4 more

SABA: Scene-aware bidirectional backdoor attack against multimodal learning

  • Research Article
  • 10.1002/cpe.70622
Region Guide Grid Cross Transformer for Image Caption
  • Mar 1, 2026
  • Concurrency and Computation: Practice and Experience
  • Jiayu Bai + 3 more

ABSTRACT In image captioning tasks, many studies have shown that using both grid feature and region feature from images helps models better understand visual content, leading to more accurate descriptions. However, to save training time and keep feature extraction efficient, most research uses pre‐trained models to get these grid and region feature. This method could provide the model with diverse feature, but the pre‐trained models used for feature extraction were trained for different purposes, leading to variations in their focus. As a result, many of the extracted visual feature may not be well‐suited for the current task, introducing a significant amount of redundant information. To resolve the feature discrepancies and redundancy caused by the differing focuses of these models, we propose a model named Region Guide Grid Cross Transformer (RGGT) for image captioning. In our model, since region feature tend to lose more global visual‐semantic information compared to grid feature, the model primarily uses grid feature as the main during encoding stage. We use multi‐head cross‐attention mechanism that allows region feature to guide the grid feature, generating new grid feature enriched with both global semantics and target‐region semantics. Furthermore, a feature refinement module based on sparse scan attention is introduced to purify the visual feature and produce new region feature derived from the refined new grid feature. In the decoding stage, to better use the target region semantics from the new region feature while preserving global feature, we further integrate and control redundancy between the new grid feature and new region feature. To achieve this, we propose a feature deep fusion module based on a gate mechanism. This module combines text feature with both region and grid feature through their respective multi‐head cross attention mechanisms. Using a gate mechanism, it automatically learns to control the proportion of each feature in the final fusion, enabling more accurate integration of the different feature information. We evaluate our RGGT model on the MSCOCO2014 dataset, with experimental results demonstrating its outstanding performance. The model significantly outperforms both comparable approaches and state‐of‐the‐art methods. The code will be made available on https://github.com/Kickdog1022/RGGT_image_caption .

  • Research Article
  • 10.56578/ataiml050105
Empowering Accessibility to Digital Space Through Generative AI to Support People with Disabilities
  • Mar 1, 2026
  • Acadlore Transactions on AI and Machine Learning
  • Abebe Kindie Awuraris + 2 more

This paper explored how generative artificial intelligence (AI) could enhance the digital accessibility of individuals with visual, auditory, and cognitive impairments.It aims to develop an adaptive and context-sensitive system to dynamically customize content in accordance with users' needs.The proposed system creates text simplification with generative AI models like Generative Pretrained Transformer 3 (GPT-3), and caption images with Contrastive Language-Image Pre-Training (CLIP).It adapts users' reactions with reinforcement learning, to enable the generation of real-time and personalized content.This project tested the system performance with mixed data, including texts, images, and videos.The outcomes revealed that the accessibility of the content had been significantly increased.At the same time, the Flesch-Kincaid Grade Level was reduced by 50% through text simplification, and the bilingual evaluation understudy (BLEU) score was ranked at 0.74 in the case of image captioning.User satisfaction had increased by 15% after feedback corrections.In addition to these results, the system demonstrated high effectiveness in supporting auditory-impaired users by achieving a subtitle synchronization accuracy of 94.6% in video content, and increasing auditory user satisfaction by 18% during accessibility evaluations.This study helped develop AI-based accessibility and provide more inclusive online environment for people with disabilities, thus facilitating their access to online content.In conclusion, the proposed system is more convenient and could offer a broader range of individual and time-sensitive user experiences, compared to the current accessibility models.

  • Research Article
  • 10.1016/s0007-0912(26)00068-1
Associate Editorial Board and cover image caption
  • Mar 1, 2026
  • British Journal of Anaesthesia

Associate Editorial Board and cover image caption

  • Research Article
  • 10.1016/j.knosys.2026.115272
Image captioning system for natural language processing using optimized attention-augmented residual convolutional neural network
  • Mar 1, 2026
  • Knowledge-Based Systems
  • Anusha P + 1 more

Image captioning system for natural language processing using optimized attention-augmented residual convolutional neural network

  • Research Article
  • 10.1111/cgf.70398
Multi‐Gated Dual‐Stream Visual Feature Fusion for Image Captioning
  • Feb 28, 2026
  • Computer Graphics Forum
  • Yuzhe Lu + 7 more

Abstract As a task at the intersection of computer vision and natural language processing, image captioning offers significant application value in domains such as intelligent human–computer interaction, accessibility support and multimedia content retrieval. The primary objective is to generate natural language descriptions by interpreting visual features, traditionally relying on heterogeneous single‐stream grid features and region features. However, existing approaches face limitations: grid features struggle to balance global semantic perception with local detail analysis, and region features exhibit weakened spatial modelling efficacy due to sparse semantic correlations. Furthermore, fusing heterogeneous visual features often lacks effective control over complementarity and redundancy, leading to descriptions prone to semantic bias or detail omission. To address these challenges, we propose a novel Multi‐Gated Dual‐Stream Visual Feature Fusion (MGDSF) for Image Captioning. Our approach enhances the semantic accuracy and completeness of generated captions through dual‐stream feature extraction and a multi‐gated fusion (MGF) mechanism. First, we employ a Mamba‐like linear attention mechanism to construct a grid feature network with hierarchical positional awareness. This network achieves global modelling while maintaining local sensitivity by dynamically modulating information flow. Second, based on the Detection Transformer (DETR) framework, we design a region feature extractor to provide complementary local object visual information. Finally, we introduce a MGF module that balances the complementarity of dual‐stream visual features and suppresses cross‐modal information redundancy via multiple context‐aware gates, thereby achieving fine‐grained visual‐semantic alignment. Experiments on MS COCO demonstrate that MGDSF surpasses existing methods on multiple evaluation metrics, achieving METEOR, ROUGE‐L and CIDEr scores of 30.0%, 59.8% and 140.1%, respectively. These results validate the effectiveness of our proposed method and indicate its broad application potential.

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • .
  • .
  • .
  • 10
  • 1
  • 2
  • 3
  • 4
  • 5

Popular topics

  • Latest Artificial Intelligence papers
  • Latest Nursing papers
  • Latest Psychology Research papers
  • Latest Sociology Research papers
  • Latest Business Research papers
  • Latest Marketing Research papers
  • Latest Social Research papers
  • Latest Education Research papers
  • Latest Accounting Research papers
  • Latest Mental Health papers
  • Latest Economics papers
  • Latest Education Research papers
  • Latest Climate Change Research papers
  • Latest Mathematics Research papers

Most cited papers

  • Most cited Artificial Intelligence papers
  • Most cited Nursing papers
  • Most cited Psychology Research papers
  • Most cited Sociology Research papers
  • Most cited Business Research papers
  • Most cited Marketing Research papers
  • Most cited Social Research papers
  • Most cited Education Research papers
  • Most cited Accounting Research papers
  • Most cited Mental Health papers
  • Most cited Economics papers
  • Most cited Education Research papers
  • Most cited Climate Change Research papers
  • Most cited Mathematics Research papers

Latest papers from journals

  • Scientific Reports latest papers
  • PLOS ONE latest papers
  • Journal of Clinical Oncology latest papers
  • Nature Communications latest papers
  • BMC Geriatrics latest papers
  • Science of The Total Environment latest papers
  • Medical Physics latest papers
  • Cureus latest papers
  • Cancer Research latest papers
  • Chemosphere latest papers
  • International Journal of Advanced Research in Science latest papers
  • Communication and Technology latest papers

Latest papers from institutions

  • Latest research from French National Centre for Scientific Research
  • Latest research from Chinese Academy of Sciences
  • Latest research from Harvard University
  • Latest research from University of Toronto
  • Latest research from University of Michigan
  • Latest research from University College London
  • Latest research from Stanford University
  • Latest research from The University of Tokyo
  • Latest research from Johns Hopkins University
  • Latest research from University of Washington
  • Latest research from University of Oxford
  • Latest research from University of Cambridge

Popular Collections

  • Research on Reduced Inequalities
  • Research on No Poverty
  • Research on Gender Equality
  • Research on Peace Justice & Strong Institutions
  • Research on Affordable & Clean Energy
  • Research on Quality Education
  • Research on Clean Water & Sanitation
  • Research on COVID-19
  • Research on Monkeypox
  • Research on Medical Specialties
  • Research on Climate Justice
Discovery logo
FacebookTwitterLinkedinInstagram

Download the FREE App

  • Play store Link
  • App store Link
  • Scan QR code to download FREE App

    Scan to download FREE App

  • Google PlayApp Store
FacebookTwitterTwitterInstagram
  • Universities & Institutions
  • Publishers
  • R Discovery PrimeNew
  • Ask R Discovery
  • Blog
  • Accessibility
  • Topics
  • Journals
  • Open Access Papers
  • Year-wise Publications
  • Recently published papers
  • Pre prints
  • Questions
  • FAQs
  • Contact us
Lead the way for us

Your insights are needed to transform us into a better research content provider for researchers.

Share your feedback here.

FacebookTwitterLinkedinInstagram
Cactus Communications logo

Copyright 2026 Cactus Communications. All rights reserved.

Privacy PolicyCookies PolicyTerms of UseCareers