A Comprehensive Review of Multimodal Large Language Models for Medical Imaging and Omics Data
A Comprehensive Review of Multimodal Large Language Models for Medical Imaging and Omics Data
- Research Article
- 10.18502/fbt.v13i1.20792
- Jan 27, 2026
- Frontiers in Biomedical Technologies
Purpose: This review focuses on how Multimodal Large Language Models (MLLMs) and multimodal AI models are advancing healthcare by integrating medical imaging and omics data. By integrating imaging techniques such as MRI, CT, and PET with genomics, transcriptomics, and proteomics, these models offer a comprehensive understanding of diseases, particularly in areas like cancer diagnosis and treatment. The study also highlights the challenges of managing complex datasets and ensuring effective feature selection. Materials and Methods: Analysed studies leveraging advanced AI models, such as Convolutional Neural Networks (CNNs) and Multimodal Neural Networks (MM-Nets), to integrate diverse data sources. These models enhance medical imaging with omics data to improve disease prediction and management. Applications reviewed include cancer subtype classification, survival outcome prediction, and precision medicine, with a particular focus on non-invasive diagnostic tools. Results: The findings underscore the transformative potential of multimodal healthcare. They significantly improve the identification of biomarkers and enable personalized treatment approaches. For instance, models like VGG19-CNN and PAGE-Net demonstrated higher accuracy in predicting cancer-specific outcomes and integrating genomic and imaging data. Moreover, the applications to single-cell analysis and radiomics showcased their ability to uncover molecular-level insights, advancing precision medicine. Conclusion: represents a breakthrough in healthcare, combining diverse data types to deliver actionable insights for disease management. While challenges such as handling complex datasets and ensuring model transparency remain, ongoing advancements in AI technologies are paving the way for their wider adoption. These models hold immense promise for improving diagnostics, guiding treatment strategies, and enhancing patient outcomes, marking a significant step toward the era of personalized medicine.
- Research Article
1
- 10.34133/icomputing.0110
- Jan 1, 2025
- Intelligent Computing
Light curves serve as a valuable source of information on stellar formation and evolution. With the rapid advancement of machine learning techniques, they can be effectively processed to extract astronomical patterns and information. In this study, we present a comprehensive evaluation of models based on deep learning and large language models (LLMs) for the automatic classification of variable star light curves, using large datasets from the Kepler and K2 missions. Special emphasis is placed on Cepheids, RR Lyrae, and eclipsing binaries, examining the influence of observational cadence and phase distribution on classification precision. Employing automated deep learning optimization, we achieve striking performance using 2 architectures: one that combines one-dimensional convolution (Conv1D) with bidirectional long short-term memory (BiLSTM) and another called the Swin Transformer. These achieved accuracies of 94% and 99%, respectively, with the latter demonstrating a notable 83% accuracy in discerning the elusive type II Cepheids that comprise merely 0.02% of the total dataset. We unveil StarWhisper LightCurve (LC), a series of 3 LLM models based on an LLM, a multimodal large language model (MLLM), and a large audio language model (LALM). Each model is fine-tuned with strategic prompt engineering and customized training methods to explore the emergent abilities of these models for astronomical data. Remarkably, StarWhisper LC series models exhibit high accuracies of around 90%, considerably reducing the need for explicit feature engineering, thereby paving the way for streamlined parallel data processing and the progression of multifaceted multimodal models in astronomical applications. The study furnishes 2 detailed catalogs illustrating the impacts of phase and sampling intervals on deep learning classification accuracy, showing that a substantial decrease of up to 14% in observation duration and 21% in sampling points can be realized without compromising accuracy by more than 10%.
- Supplementary Content
7
- 10.1111/jan.16911
- Mar 24, 2025
- Journal of Advanced Nursing
ABSTRACTAimsTo explore the potential of multimodal large language models in alleviating the documentation burden on nurses while enhancing the quality and efficiency of patient care.DesignThis position paper is informed by expert discussions and a literature review.MethodsWe extensively reviewed nursing documentation practices and advanced technologies, such as multimodal large language models. We analysed key challenges, solutions and impacts to propose a futuristic multimodal large language model‐driven model for nursing documentation.ResultsMultimodal large language models offer transformative capabilities by integrating multimodal audio, video and text data during patient encounters to dynamically update patient records in real time. This reduces manual data entry, enabling nurses to focus more on direct patient care. These systems also enhance care personalisation through predictive analytics and interoperability, which support seamless workflows and better patient outcomes. While predictive analytics could improve patient care by identifying trends and risk factors from nursing documentation, further research is required to validate its accuracy and clinical utility in real‐world settings. Ethical, legal and practical challenges, including privacy concerns and biases in artificial intelligence models, require careful consideration for successful implementation.ConclusionTransitioning to multimodal large language model‐driven documentation systems can significantly reduce administrative burdens, improve nurse satisfaction and enhance patient care. However, successful integration demands interdisciplinary collaboration, robust ethical frameworks and technological advancements.Implications for the Profession and Patient CareImplementing multimodal large language models could alleviate professional burnout, improve nurse–patient interactions, and provide dynamic, up‐to‐date patient records that facilitate informed decision making. These advancements align with the goals of patient‐centred care by enabling more meaningful engagement between nurses and patients.ImpactThe problem being addressed is the administrative burden of nursing documentation. We suggest that multimodal large language models minimise manual documentation, enhance patient care quality and significantly impact nurses and patients in diverse healthcare settings globally.
- Research Article
115
- 10.2196/59505
- Sep 25, 2024
- Journal of Medical Internet Research
In the complex and multidimensional field of medicine, multimodal data are prevalent and crucial for informed clinical decisions. Multimodal data span a broad spectrum of data types, including medical images (eg, MRI and CT scans), time-series data (eg, sensor data from wearable devices and electronic health records), audio recordings (eg, heart and respiratory sounds and patient interviews), text (eg, clinical notes and research articles), videos (eg, surgical procedures), and omics data (eg, genomics and proteomics). While advancements in large language models (LLMs) have enabled new applications for knowledge retrieval and processing in the medical field, most LLMs remain limited to processing unimodal data, typically text-based content, and often overlook the importance of integrating the diverse data modalities encountered in clinical practice. This paper aims to present a detailed, practical, and solution-oriented perspective on the use of multimodal LLMs (M-LLMs) in the medical field. Our investigation spanned M-LLM foundational principles, current and potential applications, technical and ethical challenges, and future research directions. By connecting these elements, we aimed to provide a comprehensive framework that links diverse aspects of M-LLMs, offering a unified vision for their future in health care. This approach aims to guide both future research and practical implementations of M-LLMs in health care, positioning them as a paradigm shift toward integrated, multimodal data–driven medical practice. We anticipate that this work will spark further discussion and inspire the development of innovative approaches in the next generation of medical M-LLM systems.
- Research Article
13
- 10.1007/s00261-024-04708-8
- Dec 2, 2024
- Abdominal radiology (New York)
Large language models (LLMs) and multi-modal large language models (MLLMs) represent the cutting-edge in artificial intelligence. This review provides a comprehensive overview of their capabilities and potential impact on radiology. Unlike most existing literature reviews focusing solely on LLMs, this work examines both LLMs and MLLMs, highlighting their potential to support radiology workflows such as report generation, image interpretation, EHR summarization, differential diagnosis generation, and patient education. By streamlining these tasks, LLMs and MLLMs could reduce radiologist workload, improve diagnostic accuracy, support interdisciplinary collaboration, and ultimately enhance patient care. We also discuss key limitations, such as the limited capacity of current MLLMs to interpret 3D medical images and to integrate information from both image and text data, as well as the lack of effective evaluation methods. Ongoing efforts to address these challenges are introduced.
- Research Article
- 10.1007/s10266-025-01283-2
- Dec 10, 2025
- Odontology
This study aimed to evaluate the diagnostic accuracy of multimodal large language models in classifying superior labial frenulum attachments from intraoral photographs using expert consensus as the reference standard. Five experts (two periodontists and three orthodontists) established the consensus standard by classifying frenulum attachments in 117 intraoral images as mucosal, gingival, papillary, and papilla penetrating. The same photographs were then presented to three multimodal large language models (ChatGPT 4o, Gemini 2.5 Pro, and Microsoft Copilot GPT-4), and their diagnostic performance was evaluated using accuracy, sensitivity, specificity, and F1 score. Reliability was assessed using Fleiss' and Cohen's Kappa, and diagnostic performances were compared using Cochran's Q test. Human raters demonstrated almost perfect agreement (κ = 0.838, p < 0.001), whereas large language models showed poor inter-model agreement (κ = -0.124, p < 0.001). ChatGPT achieved slight significant agreement with the consensus (κ = 0.114, p = 0.019), although its clinical relevance was negligible. Gemini (κ = 0.099) and Copilot (κ = 0.027) showed no significant agreement (p > 0.05). Copilot yielded the highest overall accuracy (46.2%), followed by Gemini (44.5%) and ChatGPT (35.0%). The performance of the large language models varied across frenulum types. Current multimodal large language models demonstrate inconsistent and clinically insufficient accuracy in the classification of superior labial frenulum attachments from photographs. Domain-specific training is essential before large language models can be considered reliable diagnostic tools in dentistry.
- Research Article
14
- 10.1007/s00330-024-11339-6
- Jan 15, 2025
- European Radiology
ObjectiveThis study aimed to develop an open-source multimodal large language model (CXR-LLaVA) for interpreting chest X-ray images (CXRs), leveraging recent advances in large language models (LLMs) to potentially replicate the image interpretation skills of human radiologists.Materials and methodsFor training, we collected 592,580 publicly available CXRs, of which 374,881 had labels for certain radiographic abnormalities (Dataset 1) and 217,699 provided free-text radiology reports (Dataset 2). After pre-training a vision transformer with Dataset 1, we integrated it with an LLM influenced by the LLaVA network. Then, the model was fine-tuned, primarily using Dataset 2. The model’s diagnostic performance for major pathological findings was evaluated, along with the acceptability of radiologic reports by human radiologists, to gauge its potential for autonomous reporting.ResultsThe model demonstrated impressive performance in test sets, achieving an average F1 score of 0.81 for six major pathological findings in the MIMIC internal test set and 0.56 for six major pathological findings in the external test set. The model’s F1 scores surpassed those of GPT-4-vision and Gemini-Pro-Vision in both test sets. In human radiologist evaluations of the external test set, the model achieved a 72.7% success rate in autonomous reporting, slightly below the 84.0% rate of ground truth reports.ConclusionThis study highlights the significant potential of multimodal LLMs for CXR interpretation, while also acknowledging the performance limitations. Despite these challenges, we believe that making our model open-source will catalyze further research, expanding its effectiveness and applicability in various clinical contexts.Key PointsQuestionHow can a multimodal large language model be adapted to interpret chest X-rays and generate radiologic reports?FindingsThe developed CXR-LLaVA model effectively detects major pathological findings in chest X-rays and generates radiologic reports with a higher accuracy compared to general-purpose models.Clinical relevanceThis study demonstrates the potential of multimodal large language models to support radiologists by autonomously generating chest X-ray reports, potentially reducing diagnostic workloads and improving radiologist efficiency.
- Research Article
10
- 10.1142/s0218001420570050
- Mar 25, 2020
- International Journal of Pattern Recognition and Artificial Intelligence
Hospitals have accumulated a large amount of medical image data which need to be analyzed and integrated so as to be able to find the needed medical image in time, which is the basis of key technologies such as intelligent diagnosis of diseases. Meanwhile, through the analysis and integrated processing of medical images, the potential value of existing medical image data can be fully explored. In this paper, the key technologies in the intelligent image knowledge discovery system and the characteristics of medical image data are studied and improved. In this paper, the characteristics of knowledge discovery and medical image data are comprehensively considered, and RDM texture features are selected as the feature representation of medical images. An improved RDM operator is proposed and proved by experimental results. Experimental results show that the improved RDM coding method can improve the stability of medical image data expression.
- Research Article
1
- 10.2196/70863
- Aug 19, 2025
- JMIR Formative Research
BackgroundThyroid nodules are common, with ultrasound imaging as the primary modality for their assessment. Risk stratification systems like the American College of Radiology Thyroid Imaging Reporting and Data System (ACR TI-RADS) have been developed but suffer from interobserver variability and low specificity. Artificial intelligence, particularly large language models (LLMs) with multimodal capabilities, presents opportunities for efficient end-to-end diagnostic processes. However, their clinical utility remains uncertain.ObjectiveThis study evaluates the accuracy and consistency of multimodal LLMs for thyroid nodule risk stratification using the ACR TI-RADS system, examining the effects of model fine-tuning, image annotation, prompt engineering, and comparing open-source versus commercial models.MethodsIn total, 3 multimodal vision-language models were evaluated: Microsoft’s open-source Large Language and Visual Assistant (LLaVA) model, its medically fine-tuned variant (Large Language and Vision Assistant for bioMedicine [LLaVA-Med]), and OpenAI’s commercial o3 model. A total of 192 thyroid nodules from publicly available ultrasound image datasets were assessed. Each model was evaluated using 2 prompts (basic and modified) and 2 image scenarios (unlabeled vs radiologist-annotated), yielding 6912 responses. Model outputs were compared with expert ratings for accuracy and consistency. Statistical comparisons included Chi-square tests, Mann-Whitney U tests, and Fleiss’ kappa for interrater reliability.ResultsOverall, 88.4% (6110/6912) of responses were valid, with the o3 model producing the highest validity rate (2273/2304, 98.6%), followed by LLaVA (2108/2304, 91.5%) and LLaVA-Med (1729/2304, 75%; P<.001). The o3 model demonstrated the highest accuracy overall, achieving up to 57.3% accuracy in Thyroid Imaging Reporting and Data System (TI-RADS) classification, although still remaining suboptimal. Labeled images improved accuracy marginally in nodule margin assessment only when evaluating LLaVA models (407/768, 53% to 447/768, 58.2%; P=.04). Prompt engineering improved accuracy for composition (649/1,152, 56.3% vs 483/1152, 41.9%; P<.001), but significantly reduced accuracy for shape, margins, and overall classification. Consistency was the highest with the o3 model (up to 85.4%), but was comparable for LLaVA and significantly improved with image labeling and modified prompts across multiple TI-RADS categories (P<.001). Subgroup analysis for o3 alone showed prompt engineering did not affect accuracy significantly but markedly improved consistency across all TI-RADS categories (up to 97.1% for shape, P<.001). Interrater reliability was consistently poor across all combinations (Fleiss’ kappa<0.60).ConclusionsThe study demonstrates the comparative advantages and limitations of multimodal LLMs for thyroid nodule risk stratification. While the commercial model (o3) consistently outperformed open-source models in accuracy and consistency, even the best-performing model outputs remained suboptimal for direct clinical deployment. Prompt engineering significantly enhanced output consistency, particularly in the commercial model. These findings underline the importance of strategic model optimization techniques and highlight areas requiring further development before multimodal LLMs can be reliably used in clinical thyroid imaging workflows.
- Research Article
16
- 10.1016/j.media.2024.103279
- Jul 20, 2024
- Medical Image Analysis
Interpretable medical image Visual Question Answering via multi-modal relationship graph learning
- Research Article
1
- 10.1016/j.spinee.2025.03.011
- Sep 1, 2025
- The spine journal : official journal of the North American Spine Society
GPT4LFS (generative pretrained transformer 4 omni for lumbar foramina stenosis): enhancing lumbar foraminal stenosis image classification through large multimodal models.
- Research Article
1
- 10.1007/s11604-025-01861-y
- Sep 12, 2025
- Japanese Journal of Radiology
PurposeTo assess and compare the accuracy and legitimacy of multimodal large language models (LLMs) on the Japan Diagnostic Radiology Board Examination (JDRBE).Materials and methodsThe dataset comprised questions from JDRBE 2021, 2023, and 2024, with ground-truth answers established through consensus among multiple board-certified diagnostic radiologists. Questions without associated images and those lacking unanimous agreement on answers were excluded. Eight LLMs were evaluated: GPT-4 Turbo, GPT-4o, GPT-4.5, GPT-4.1, o3, o4-mini, Claude 3.7 Sonnet, and Gemini 2.5 Pro. Each model was evaluated under two conditions: with inputting images (vision) and without (text-only). Performance differences between the conditions were assessed using McNemar’s exact test. Two diagnostic radiologists (with 2 and 18 years of experience) independently rated the legitimacy of responses from four models (GPT-4 Turbo, Claude 3.7 Sonnet, o3, and Gemini 2.5 Pro) using a five-point Likert scale, blinded to model identity. Legitimacy scores were analyzed using Friedman’s test, followed by pairwise Wilcoxon signed-rank tests with Holm correction.ResultsThe dataset included 233 questions. Under the vision condition, o3 achieved the highest accuracy at 72%, followed by o4-mini (70%) and Gemini 2.5 Pro (70%). Under the text-only condition, o3 topped the list with an accuracy of 67%. Addition of image input significantly improved the accuracy of two models (Gemini 2.5 Pro and GPT-4.5), but not the others. Both o3 and Gemini 2.5 Pro received significantly higher legitimacy scores than GPT-4 Turbo and Claude 3.7 Sonnet from both raters.ConclusionRecent multimodal LLMs, particularly o3 and Gemini 2.5 Pro, have demonstrated remarkable progress on JDRBE questions, reflecting their rapid evolution in diagnostic radiology.Secondary abstractEight multimodal large language models were evaluated on the Japan Diagnostic Radiology Board Examination. OpenAI’s o3 and Google DeepMind’s Gemini 2.5 Pro achieved high accuracy rates (72% and 70%) and received good legitimacy scores from human raters, demonstrating steady progress.
- Research Article
- 10.4274/dir.2026.263696
- Feb 12, 2026
- Diagnostic and interventional radiology (Ankara, Turkey)
Pneumothorax requires rapid recognition and accurate interpretation of chest X-rays (CXRs), particularly in acute settings where delays can have serious consequences. With the emergence of advanced image interpretation models capable of visual analysis, their diagnostic reliability in radiology practice remains to be determined. This study aimed to assess the diagnostic performance of three state-of-the-art systems in detecting pneumothorax using a large, well-annotated dataset. A total of 10,675 CXRs from the publicly available SIIM-ACR Pneumothorax Segmentation dataset were analyzed. Three multimodal models (GPT-4o, Gemini 2 Pro, and Claude 4 Sonnet) were evaluated using a uniform, image-based approach. Each model's binary outputs (presence: 1, absence: 0) were compared with reference results to determine accuracy, sensitivity, specificity, precision, and F1 scores. Additional subgroup analyses were conducted across pneumothorax size categories: small, medium, and large. Pairwise statistical comparisons were performed using McNemar's test. Sensitivity, specificity, and overall accuracy are reported with corresponding 95% confidence intervals. The prevalence of pneumothorax in the dataset was 22.3% (n = 2,379). All models demonstrated high specificity (above 0.90) but consistently low sensitivity (0.16-0.36). The best overall performance was observed with Gemini 2, which achieved an accuracy of 0.79 and specificity of 0.95, whereas Claude 4 showed greater sensitivity (0.20-0.34) across lesion-size categories. Diagnostic performance improved with increasing pneumothorax size, but smaller lesions remained difficult to identify. Pairwise comparisons confirmed statistically significant differences among all evaluated systems (P < 0.050). In this large-scale evaluation, the tested models exhibited strong reliability in identifying normal examinations but limited ability to detect subtle or small pneumothoraxes. Despite high specificity, low sensitivity limits the use of current Multimodal large language models as rule-out tools for pneumothorax. With continued refinement, these models may eventually support radiologists by improving workflow efficiency and diagnostic confidence. Automated systems capable of high specificity but low sensitivity should not be relied upon to exclude pneumothorax. However, they may serve as valuable assistants for confirming positive findings and prioritizing urgent cases in busy clinical workflows.
- Research Article
- 10.48175/ijarsct-12010
- Jul 4, 2023
- International Journal of Advanced Research in Science, Communication and Technology
In today’s digital era, the demand for digital medical images is rapidly increasing. Hospitals are transitioning to filmless imaging systems, emphasizing the need for efficient storage and seamless transmission of medical images. To meet these requirements, medical image compression becomes essential. However, medical image compression typically necessitates lossless compression techniques to preserve the diagnostic quality and integrity of the images. There are several challenges associated with medical image compression and management. Firstly, medical image management and image data mining involve organizing and accessing large volumes of medical images efficiently for clinical and research purposes. Secondly, bioimaging, which encompasses various imaging modalities like microscopy and molecular imaging, presents specific requirements and challenges for compression algorithms. Thirdly, virtual reality technologies are increasingly utilized in medical visualizations, demanding efficient compression methods to handle the high resolution and immersive nature of VR medical imaging data. Lastly, neuro imaging deals with complex brain imaging data, requiring specialized compression techniques tailored to the unique characteristics of these images. As the amount of medical image data continues to grow, image processing and visualization algorithms have to be adapted to handle the increased workload. Researchers and developers have been working on various compression algorithms to address these challenges and optimize medical image compression. This review paper compares different compression algorithms that would provide valuable insights into the strengths, limitations, and performance metrics of various techniques. It would assist researchers, clinicians, and imaging professionals in selecting the most suitable compression algorithm for their specific needs, considering factors such as compression ratio, computational complexity, and image quality preservation. By comprehensively comparing compression algorithms, this review paper contributes to advancing the field of medical image compression, facilitating efficient image storage, transmission, and analysis in healthcare settings.
- Research Article
- 10.1182/blood-2025-6131
- Nov 3, 2025
- Blood
AI multimodal large language model on CAR-T pre-leukapheresis evaluation to predict monitoring needs post infusion for early dismissal planning
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.