Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training
Vision-language pre-training has been an emerging and fast-developing research topic, which transfers multi-modal knowledge from rich-resource pre-training task to limited-resource downstream tasks. Unlike existing works that predominantly learn a single generic encoder, we present a pre-trainable Universal Encoder-DEcoder Network (Uni-EDEN) to facilitate both vision-language perception (e.g., visual question answering) and generation (e.g., image captioning). Uni-EDEN is a two-stream Transformer-based structure, consisting of three modules: object and sentence encoders that separately learns the representations of each modality and sentence decoder that enables both multi-modal reasoning and sentence generation via inter-modal interaction. Considering that the linguistic representations of each image can span different granularities in this hierarchy including, from simple to comprehensive, individual label, a phrase, and a natural sentence, we pre-train Uni-EDEN through multi-granular vision-language proxy tasks: Masked Object Classification, Masked Region Phrase Generation, Image-Sentence Matching, and Masked Sentence Generation. In this way, Uni-EDEN is endowed with the power of both multi-modal representation extraction and language modeling. Extensive experiments demonstrate the compelling generalizability of Uni-EDEN by fine-tuning it to four vision-language perception and generation downstream tasks.
- Research Article
3
- 10.1109/tpami.2025.3582000
- Jan 1, 2025
- IEEE transactions on pattern analysis and machine intelligence
Recent advancements in language models have demonstrated its capacity of context understanding and generative representations. Leveraged by these developments, we propose a novel multimodal trajectory predictor based on a vision-language model, named VLMTraj, which fully takes advantage of the prior knowledge of multimodal large language models and the human-like reasoning across diverse modality information. The key idea of our model is to reframe the trajectory prediction task into a visual question answering format, using historical information as context and instructing the language model to make predictions in a conversational manner. Specifically, we transform all the inputs into a natural language style: historical trajectories are converted into text prompts, and scene images are described through image captioning. Additionally, visual features from input images are also transformed into tokens via a modality encoder and connector. The transformed data is then formatted to be used in a language model. Next, in order to guide the language model in understanding and reasoning high-level knowledge, such as scene context and social relationships between pedestrians, we introduce an auxiliary multi-task question and answers. For training, we first optimize a numerical tokenizer with the prompt data to effectively separate integer and decimal parts, allowing us to capture correlations between consecutive numbers in the language model. We then train our language model using all the visual question answering prompts. During model inference, we implement both deterministic and stochastic prediction methods through beam-search-based most-likely prediction and temperature-based multimodal generation. Our VLMTraj validates that the language-based model can be a powerful pedestrian trajectory predictor, and outperforms existing numerical-based predictor methods. Extensive experiments show that VLMTraj can successfully understand social relationships and accurately extrapolate the multimodal futures on public pedestrian trajectory prediction benchmarks.
- Research Article
5
- 10.2196/56627
- Aug 5, 2024
- JMIR Medical Informatics
BackgroundMedical image analysis, particularly in the context of visual question answering (VQA) and image captioning, is crucial for accurate diagnosis and educational purposes.ObjectiveOur study aims to introduce BioMedBLIP models, fine-tuned for VQA tasks using specialized medical data sets such as Radiology Objects in Context and Medical Information Mart for Intensive Care-Chest X-ray, and evaluate their performance in comparison to the state of the art (SOTA) original Bootstrapping Language-Image Pretraining (BLIP) model.MethodsWe present 9 versions of BioMedBLIP across 3 downstream tasks in various data sets. The models are trained on a varying number of epochs. The findings indicate the strong overall performance of our models. We proposed BioMedBLIP for the VQA generation model, VQA classification model, and BioMedBLIP image caption model. We conducted pretraining in BLIP using medical data sets, producing an adapted BLIP model tailored for medical applications.ResultsIn VQA generation tasks, BioMedBLIP models outperformed the SOTA on the Semantically-Labeled Knowledge-Enhanced (SLAKE) data set, VQA in Radiology (VQA-RAD), and Image Cross-Language Evaluation Forum data sets. In VQA classification, our models consistently surpassed the SOTA on the SLAKE data set. Our models also showed competitive performance on the VQA-RAD and PathVQA data sets. Similarly, in image captioning tasks, our model beat the SOTA, suggesting the importance of pretraining with medical data sets. Overall, in 20 different data sets and task combinations, our BioMedBLIP excelled in 15 (75%) out of 20 tasks. BioMedBLIP represents a new SOTA in 15 (75%) out of 20 tasks, and our responses were rated higher in all 20 tasks (P<.005) in comparison to SOTA models.ConclusionsOur BioMedBLIP models show promising performance and suggest that incorporating medical knowledge through pretraining with domain-specific medical data sets helps models achieve higher performance. Our models thus demonstrate their potential to advance medical image analysis, impacting diagnosis, medical education, and research. However, data quality, task-specific variability, computational resources, and ethical considerations should be carefully addressed. In conclusion, our models represent a contribution toward the synergy of artificial intelligence and medicine. We have made BioMedBLIP freely available, which will help in further advancing research in multimodal medical tasks.
- Research Article
61
- 10.1016/j.eswa.2022.118669
- Aug 28, 2022
- Expert Systems with Applications
Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answering (VQA) is an open problem. Given that pretrained language models have been shown to include world knowledge, we propose to use a unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images and pretrained language models. More specifically, we verbalize the image contents and allow language models to better leverage their implicit knowledge to solve knowledge-intensive tasks. Focusing on a visual question answering task which requires external knowledge (OK-VQA), our contributions are: (i) a text-only model that outperforms pretrained multimodal (image-text) models of comparable number of parameters; (ii) confirmation that our text-only method is specially effective for tasks requiring external knowledge, as it is less effective in standard a VQA task (VQA 2.0); and (iii) our method attains results in the state-of-the-art when increasing the size of the language model. We also significantly outperform current multimodal systems, even though augmented with external knowledge. Our qualitative analysis on OK-VQA reveals that automatic captions often fail to capture relevant information in the images, which seems to be balanced by the better inference ability of the text-only language models. Our work opens up possibilities to further improve inference in visio-linguistic tasks.
- Research Article
- 10.54254/2755-2721/2026.as31461
- Jan 26, 2026
- Applied and Computational Engineering
Multimodal understanding, which requires models to jointly reason over visual and linguistic information, has become a core challenge in artificial intelligence (AI). Visual Question Answering (VQA) stands as a paradigmatic task for investigating these multimodal reasoning capabilities. While early VQA systems relied on task-specific architectures, recent breakthroughs in Multimodal Large Language Models (MLLMs) have significantly reshaped the field by proposing unified, instruction-driven multimodal reasoning frameworks. By conducting a systematic literature review, this paper scrutinizes the evolution of VQA from traditional CNNLSTM-based models to modern MLLM-based approaches. The review centers on representative architectures and training paradigms, including BLIP-2 and LLaVA, to analyze how large language models and pretrained vision encoders are integrated for flexible and open-ended visual reasoning. In addition, this paper identifies and deliberates on critical challenges confronting contemporary MLLMs, encompassing modality imbalance, insufficient cross-modal alignment, and hallucinations. This paper concludes that while MLLMs have substantially expanded the application scope and functional capabilities of VQA systems, they still grapple with reliable visual grounding and balanced multimodal fusion. Addressing these limitations is paramount for constructing trustworthy and robust VQA systems, and future research should prioritize improving alignment mechanisms and mitigating hallucinations in multimodal reasoning.
- Preprint Article
- 10.21203/rs.3.rs-5533456/v1
- Nov 29, 2024
Recent advancements in vision-language models have achieved remarkable results in making language models understand vision inputs. However, a unified approach to align these models across diverse tasks such as image captioning and visual question answering remains a challenge. Existing methods either require very big language models or very big datasets which is not efficient in utilizing exist ing models. This paper addresses this gap and devises a training strategy of auto-regressive vision-language models, to unify vision-language tasks like image captioning and visual question answering. We propose four training stages for aligning the vision model with the language model, in other words, the lan guage model is given an ability to process visual inputs. We also devise different attention masks for training transformer-based language models that improve the quality of visual features. Further, we introduce some findings, 1) the attention mask should not be applied on visual inputs, 2) the Language model converges faster on AI-generated data, 3) More work should be done in the alignment stage during the pre-training of the model, 4) the model can easily adapt to any down stream tasks like visual question answering on healthcare datasets like PathVQA. After training the model for one epoch for all the stages, it outperforms large models like VILA-13 billion models on common benchmarks like CIDEr scores on COCO and Flickr30k datasets and achieves very close scores to GIT-2 on the same dataset despite being a much smaller model trained on a much smaller dataset. All of the training is done using best practices available like multi-GPU parallel training, lower-precision training with 16-bit float numbers, faster atten tion (SDPA), and gradient accumulation, and completed the training within 12 hours.
- Book Chapter
- 10.3233/faia250227
- Mar 17, 2025
- Frontiers in artificial intelligence and applications
This chapter explores the advancements and challenges in achieving comprehensive scene understanding and visual reasoning through neurosymbolic integration and Multimodal Large Language Models (MLLMs). It begins by highlighting the limitations of basic vision tasks in extracting contextual and relational information from scenes, introducing scene graphs as a structured representation to bridge this gap. The chapter delves into Scene Graph Generation (SGG) methods, emphasising the importance of incorporating common sense knowledge from knowledge graphs to enhance the accuracy and expressiveness of scene graphs. The NeuSyRE framework is presented as a neurosymbolic approach for enriched scene graph generation and reasoning, demonstrating its effectiveness in downstream tasks such as image captioning and visual question answering. The chapter also examines the role of MLLMs in visual reasoning, discussing their architectures, performance on zero-shot tasks and challenges in handling fine-grained visual details. MARVEL, a novel benchmark for abstract visual reasoning, is introduced to evaluate the perceptual and reasoning capabilities of MLLMs. Insights from MARVEL highlight the limitations of current MLLMs in solving complex reasoning tasks and underscore the potential of neurosymbolic systems to address these challenges. The chapter concludes by emphasising the synergy between neurosymbolic approaches and MLLMs in advancing visual intelligence and achieving robust and explainable AI systems.
- Research Article
10
- 10.1609/aaai.v37i11.26569
- Jun 26, 2023
- Proceedings of the AAAI Conference on Artificial Intelligence
Visual Question Answering (VQA) aims to answer the natural language question about a given image by understanding multimodal content. However, the answer quality of most existing visual-language pre-training (VLP) methods is still limited, mainly due to: (1) Incompatibility. Upstream pre-training tasks are generally incompatible with downstream question answering tasks, which makes the knowledge from the language model not well transferable to downstream tasks, and greatly limits their performance in few-shot scenarios; (2) Under-fitting. They generally do not integrate human priors to compensate for universal knowledge from language models, so as to fit the challenging VQA problem and generate reliable answers. To address these issues, we propose HybridPrompt, a cloze- and verify-style hybrid prompt framework with bridging language models and human priors in prompt tuning for VQA. Specifically, we first modify the input questions into the cloze-style prompts to narrow the gap between upstream pre-training tasks and downstream VQA task, which ensures that the universal knowledge in the language model can be better transferred to subsequent human prior-guided prompt tuning. Then, we imitate the cognitive process of human brain to introduce topic and sample related priors to construct a dynamic learnable prompt template for human prior-guided prompt learning. Finally, we add fixed-length learnable free-parameters to further enhance the generalizability and scalability of prompt learning in the VQA model. Experimental results verify the effectiveness of HybridPrompt, showing that it achieves competitive performance against previous methods on widely-used VQAv2 dataset and obtains new state-of-the-art results. Our code is released at: https://github.com/zhizhi111/hybrid.
- Research Article
5
- 10.1609/aaai.v39i9.33073
- Apr 11, 2025
- Proceedings of the AAAI Conference on Artificial Intelligence
Face Anti-Spoofing (FAS) is essential for ensuring the security and reliability of facial recognition systems. Most existing FAS methods are formulated as binary classification tasks, providing confidence scores without interpretation. They exhibit limited generalization in out-of-domain scenarios, such as new environments or unseen spoofing types. In this work, we introduce a multimodal large language model (MLLM) framework for FAS, termed Interpretable Face Anti-Spoofing (I-FAS), which transforms the FAS task into an interpretable visual question answering (VQA) paradigm. Specifically, we propose a Spoof-aware Captioning and Filtering (SCF) strategy to generate high-quality captions for FAS images, enriching the model's supervision with natural language interpretations. To mitigate the impact of noisy captions during training, we develop a Lopsided Language Model (L-LM) loss function that separates loss calculations for judgment and interpretation, prioritizing the optimization of the former. Furthermore, to enhance the model's perception of global visual features, we design a Globally Aware Connector (GAC) to align multi-level visual representations with the language model. Extensive experiments on standard and newly devised One to Eleven cross-domain benchmarks, comprising 12 public datasets, demonstrate that our method significantly outperforms state-of-the-art methods.
- Research Article
67
- 10.1016/j.inffus.2021.02.006
- Feb 12, 2021
- Information Fusion
DMRFNet: Deep Multimodal Reasoning and Fusion for Visual Question Answering and explanation generation
- Research Article
- Mar 17, 2025
- ArXiv
Scientific research demands sophisticated reasoning over multimodal data, a challenge especially prevalent in biology. Despite recent advances in multimodal large language models (MLLMs) for AI-assisted research, existing multimodal reasoning benchmarks only target up to college-level difficulty, while research-level benchmarks emphasize lower-level perception, falling short of the complex multimodal reasoning needed for scientific discovery. To bridge this gap, we introduce MicroVQA, a visual-question answering (VQA) benchmark designed to assess three reasoning capabilities vital in research workflows: expert image understanding, hypothesis generation, and experiment proposal. MicroVQA consists of 1,042 multiple-choice questions (MCQs) curated by biology experts across diverse microscopy modalities, ensuring VQA samples represent real scientific practice. In constructing the benchmark, we find that standard MCQ generation methods induce language shortcuts, motivating a new two-stage pipeline: an optimized LLM prompt structures question-answer pairs into MCQs; then, an agent-based ‘RefineBot’ updates them to remove shortcuts. Benchmarking on state-of-the-art MLLMs reveal a peak performance of 53%; models with smaller LLMs only slightly underperform top models, suggesting that language-based reasoning is less challenging than multimodal reasoning; and tuning with scientific articles enhances performance. Expert analysis of chain-of-thought responses shows that perception errors are the most frequent, followed by knowledge errors and then overgeneralization errors. These insights highlight the challenges in multimodal scientific reasoning, showing MicroVQA is a valuable resource advancing AI-driven biomedical research. MicroVQA is available here, project here.
- Research Article
26
- 10.1145/3711680
- Mar 5, 2025
- ACM Computing Surveys
Visual Question Answering (VQA) is a challenge task that combines natural language processing and computer vision techniques and gradually becomes a benchmark test task in multimodal large language models (MLLMs). The goal of our survey is to provide an overview of the development of VQA and a detailed description of the latest models with high timeliness. This survey gives an up-to-date synthesis of natural language understanding of images and text, as well as the knowledge reasoning module based on image-question information on the core VQA tasks. In addition, we elaborate on recent advances in extracting and fusing modal information with vision-language pretraining models and multimodal large language models in VQA. We also exhaustively review the progress of knowledge reasoning in VQA by detailing the extraction of internal knowledge and the introduction of external knowledge. Finally, we present the datasets of VQA and different evaluation metrics and discuss possible directions for future work.
- Research Article
29
- 10.1016/j.media.2024.103279
- Jul 20, 2024
- Medical Image Analysis
Interpretable medical image Visual Question Answering via multi-modal relationship graph learning
- Research Article
19
- 10.1145/3573891
- Mar 24, 2023
- ACM Transactions on Asian and Low-Resource Language Information Processing
In sequence-to-sequence modeling tasks, such as image captioning, machine translation, and visual question answering, encoder-decoder architectures are state of the art. An encoder, convolutional neural network (CNN) encodes input images into fixed dimensional vector representation in the image captioning task, whereas a decoder, a recurrent neural network, performs language modeling and generates the target descriptions. Recent CNNs use the same operation over every pixel; however, all the image pixels are not equally important. To address this, the proposed method uses a dynamic convolution-based encoder for image encoding or feature extraction, Long-Short-Term-Memory as a decoder for language modeling, and X-Linear attention to make the system robust. Encoders, attentions, and decoders are important aspects of the image captioning task; therefore, we experiment with various encoders, decoders, and attention mechanisms. Most of the works for image captioning have been carried out for the English language in the existing literature. We propose a novel approach for caption generation from images in Hindi. Hindi, widely spoken in South Asia and India, is the fourth most-spoken language globally; it is India’s official language. The proposed method utilizes dynamic convolution operation on the encoder side to obtain a better image encoding quality. The Hindi image captioning dataset is manually created by translating the popular MSCOCO dataset from English to Hindi. In terms of BLEU scores, the performance of the proposed method is compared with other baselines, and the results obtained show that the proposed method outperforms different baselines. Manual human assessment in terms of adequacy and fluency of the captions generated further determines the efficacy of the proposed method in generating good-quality captions.
- Research Article
- 10.3389/fcomp.2025.1626346
- Sep 1, 2025
- Frontiers in Computer Science
Multimodal large-scale language modeling has become the mainstream approach in natural language processing tasks and has been applied to various cross-modal fields such as image description and visual question answering. However, large-scale language modeling has high computational complexity and a large operational scale, which presents significant challenges for deployment in many resource-constrained scenarios. To address such problems, a lightweight multimodal framework, LLaVA-GM, is proposed, based on LLaVA, which can be deployed on devices with low resource requirements and has greatly reduced model parameters. It can also be tested on common VQA tasks and achieves good performance. The main contributions and work are as follows: First, it is found that the backbone of the Vicuna language model in LLaVA is too redundant. When fine-tuning downstream tasks, a very small amount of data sets is difficult to affect the language model. It is replaced with a new Gemma language model, thereby achieving fast task-specific adaptation with fewer parameters and data. Second, in response to the problem of information redundancy, the MoE mixed expert model is introduced. This model can be used in combination with itself, combining the MoE mixed expert model with Gemma to reduce the amount of computation while maintaining performance. Directly training the entire model will lead to a decline in performance. A multi-stage training strategy is adopted to maintain performance. First, the MLP layer is trained for visual adaptation, then the entire Gemma model is trained to improve multimodal capabilities, and finally only the MoE layer is trained for sparsification to ensure a smooth transition from dense models to sparse models. The experiment was tested on multiple VQA datasets and achieved good performance, confirming the potential of this compact model in downstream multimodal applications.
- Conference Article
42
- 10.18653/v1/2022.findings-acl.196
- Jan 1, 2022
Recent advances in multimodal vision and language modeling have predominantly focused on the English language, mostly due to the lack of multilingual multimodal datasets to steer modeling efforts. In this work, we address this gap and provide xGQA, a new multilingual evaluation benchmark for the visual question answering task. We extend the established English GQA dataset to 7 typologically diverse languages, enabling us to detect and explore crucial challenges in cross-lingual visual question answering. We further propose new adapter-based approaches to adapt multimodal transformer-based models to become multilingual, and -- vice versa -- multilingual models to become multimodal. Our proposed methods outperform current state-of-the-art multilingual multimodal models (e.g., M3P) in zero-shot cross-lingual settings, but the accuracy remains low across the board; a performance drop of around 38 accuracy points in target languages showcases the difficulty of zero-shot cross-lingual transfer for this task. Our results suggest that simple cross-lingual transfer of multimodal models yields latent multilingual multimodal misalignment, calling for more sophisticated methods for vision and multilingual language modeling. The xGQA dataset is available online at: this https URL.