Medical Visual Question Answering (MedVQA) is a crucial intersection of artificial intelligence and healthcare. It enables systems to interpret medical images—such as X-rays, MRIs, and pathology slides—and respond to clinical queries. Early approaches primarily relied on discriminative models, which select answers from predefined candidates. However, these methods struggle to effectively address open-ended, domain-specific, or complex queries. Recent advancements have shifted the focus toward generative models, leveraging autoregressive decoders, large language models (LLMs), and multimodal large language models (MLLMs) to generate more nuanced and free-form answers. This review comprehensively examines the paradigm shift from discriminative to generative systems, examining generative MedVQA works on their model architectures and training process, summarizing evaluation benchmarks and metrics, highlighting key advances and techniques that propels the development of generative MedVQA, such as concept alignment, instruction tuning, and parameter-efficient fine-tuning (PEFT), alongside strategies for data augmentation and automated dataset creation. Finally, we propose future directions to enhance clinical reasoning and intepretability, build robust evaluation benchmarks and metrics, and employ scalable training strategies and deployment solutions. By analyzing the strengths and limitations of existing generative MedVQA approaches, we aim to provide valuable insights for researchers and practitioners working in this domain.
Read full abstract