Evaluating the Accuracy and Diagnostic Reasoning of Multimodal Large Language Models in Interpreting Neuroradiology Cases From RadioGraphics.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

To evaluate the accuracy and reasoning capabilities of large multimodal language models compared with those of neuroradiology subspecialty-trained radiologists in neuroradiology case interpretation. This experimental study used custom-made 401 radiologic quizzes derived from articles published in RadioGraphics covering neuroradiology and head and neck topics (October 2020 to February 2024). We prompted the GPT-4 Turbo with Vision (GPT-4V), GPT-4 Omni, Gemini Flash, and Claude models to provide the top three differential diagnoses with a rationale and describe examination characteristics such as imaging modality, sequence, use of contrast, image plane, and body part. The temperature was adjusted to 0 and 1 (T1). Two neuroradiologists answered the same questions. The accuracies of the large language models (LLMs) and the neuroradiologists were compared using generalized estimating equations. Three neuroradiologists assessed the rationale provided by the LLMs for their differential diagnoses using four-point scales, separately for specific lesion locations and imaging findings, and evaluated the presence of hallucinations and the overall acceptability of the responses. Top-3 accuracy (i.e., correct answers present among top-3 differential diagnoses) of LLMs ranged from 29.9% (120 of 401) to 49.4% (198 of 401, obtained with GPT-4V in the T1 setting), while radiologists achieved 80.3% (322 of 401) and 68.3% (274 of 401), respectively (P < 0.001). Regarding the rationale for differential diagnoses, GPT-4V (T1) accurately identified both the specific lesion location and imaging findings in 30.7% (123 of 401) and 12.9% (16 of 124) of cases without textual clinical history. Hallucinations occurred in 4.5% (18 of 401), and only 29.4% (118 of 401) of the LLM-generated analyses were deemed acceptable. GPT-4V (T1) demonstrated high accuracy in identifying the imaging modality (97.4% [800 of 821]) and scanned body parts (92.2% [756 of 820]). LLMs remarkably underperformed compared with neuroradiologists and showed unsatisfactory reasoning for their differential diagnoses, with performance declining further in cases without textual input of clinical history. These findings highlight the limitations of current multimodal LLMs in neuroradiological interpretation and their reliance on text input.

Similar Papers
  • Conference Article
  • 10.1145/3711875.3729128
CrossLM: A Data-Free Collaborative Fine-Tuning Framework for Large and Small Language Models
  • Jun 23, 2025
  • Yongheng Deng + 5 more

While large language models (LLMs) are endowed with broad knowledge, their task-specific performance is often suboptimal. Fine-tuning LLMs with task-specific data from diverse nodes is necessary, but this data is typically safeguarded and not shared publicly due to privacy concerns. A common solution involves downstream nodes downloading the LLM locally and fine-tuning it with their proprietary data. However, owners often regard pre-trained LLMs as valuable assets and are reluctant to share them. Additionally, the significant computational resources required by LLMs make local fine-tuning impractical for many nodes. To mitigate these problems, this paper proposes CrossLM, a data-free collaborative fine-tuning framework for large and small language models. CrossLM enables resource-constrained nodes to train smaller language models (SLMs) using their private task-specific data. These SLMs are subsequently leveraged to promote the task-specific natural language generation and understanding capabilities of the LLMs. Simultaneously, the SLMs of nodes also benefit from enhancement by the fine-tuned LLMs. In this way, CrossLM avoids sharing private data and proprietary LLMs, and also reduces the resource requirements of nodes. Through extensive experiments across a range of benchmark tasks and popular language models, we demonstrate that CrossLM significantly boosts the task-specific performance of both LLMs and SLMs while preserving the generalization capabilities of LLMs.

  • Research Article
  • Cite Count Icon 34
  • 10.1148/radiol.241668
Comparing Large Language Model and Human Reader Accuracy with New England Journal of Medicine Image Challenge Case Image Inputs.
  • Dec 1, 2024
  • Radiology
  • Pae Sun Suh + 14 more

Background Application of multimodal large language models (LLMs) with both textual and visual capabilities has been steadily increasing, but their ability to interpret radiologic images is still doubted. Purpose To evaluate the accuracy of LLMs and compare it with that of human readers with varying levels of experience and to assess the factors affecting LLM accuracy in answering New England Journal of Medicine Image Challenge cases. Materials and Methods Radiologic images of cases from October 13, 2005, to April 18, 2024, were retrospectively reviewed. Using text and image inputs, LLMs (Open AI's GPT-4 Turbo with Vision [GPT-4V] and GPT-4 Omni [GPT-4o], Google's DeepMind Gemini 1.5 Pro, and Anthropic's Claude 3) provided answers. Human readers (seven junior faculty radiologists, two clinicians, one in-training radiologist, and one medical student), blinded to the published answers, also answered. LLM accuracy with and without image inputs and short (cases from 2005 to 2015) versus long text inputs (from 2016 to 2024) was evaluated in subgroup analysis to determine the effect of these factors. Factor analysis was assessed using multivariable logistic regression. Accuracy was compared with generalized estimating equations, with multiple comparisons adjusted by using Bonferroni correction. Results A total of 272 cases were included. GPT-4o achieved the highest overall accuracy among LLMs (59.6%; 162 of 272), outperforming a medical student (47.1%; 128 of 272; P < .001) but not junior faculty (80.9%; 220 of 272; P < .001) or the in-training radiologist (70.2%; 191 of 272; P = .003). GPT-4o exhibited similar accuracy regardless of image inputs (without images vs with images, 54.0% [147 of 272] vs 59.6% [162 of 272], respectively; P = .59). Human reader accuracy was unaffected by text length, whereas LLMs demonstrated higher accuracy with long text inputs (all P < .001). Text input length affected LLM accuracy (odds ratio range, 3.2 [95% CI: 1.9, 5.5] to 6.6 [95% CI: 3.7, 12.0]). Conclusion LLMs demonstrated substantial accuracy with text and image inputs, outperforming a medical student. However, their accuracy decreased with shorter text lengths, regardless of image input. © RSNA, 2024 Supplemental material is available for this article.

  • Research Article
  • Cite Count Icon 11
  • 10.1287/ijds.2023.0007
How Can IJDS Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
  • Apr 1, 2023
  • INFORMS Journal on Data Science
  • Galit Shmueli + 7 more

How Can <i>IJDS</i> Authors, Reviewers, and Editors Use (and Misuse) Generative AI?

  • Research Article
  • Cite Count Icon 3
  • 10.1109/access.2024.3419079
Tax Intelligent Decision-Making Language Model
  • Jan 1, 2024
  • IEEE Access
  • Yan Zhong + 2 more

Large language models’ exceptional all-purpose abilities have made human-computer conversations normal, but for particular industries and verticals, they fall short of enhancing the expertise of knowledge and the timeliness of information. In order to give current information, and provide improved search capabilities, large language models need to increasingly incorporate specialist resources and databases. In this research, a model for intelligent assisted decision-making was proposed that the model incorporates knowledge from domain-specific databases and real-time data and uses large language models to offer expert tax guidance. The research proposed to overcome the limits of general-purpose language models and deliver specialized advise for tax-related inquiries by complementing large language models with domain-specific information.The results we achieve demonstrate that by offering tax advice tailored to a given situation, and the model we proposed goes beyond the validity of general large language language models. Our contribution is that not only exploring the combination of tax area and large language model, but also proposing a new effective model for government tax department to use in real life. This study highlights the potential of big language models for use in real-world professional domains and advances the field of domain-specific human-computer interaction.

  • Research Article
  • Cite Count Icon 5
  • 10.34133/icomputing.0110
Deep Learning and Methods Based on Large Language Models Applied to Stellar Light Curve Classification
  • Jan 1, 2025
  • Intelligent Computing
  • Yu-Yang Li + 6 more

Light curves serve as a valuable source of information on stellar formation and evolution. With the rapid advancement of machine learning techniques, they can be effectively processed to extract astronomical patterns and information. In this study, we present a comprehensive evaluation of models based on deep learning and large language models (LLMs) for the automatic classification of variable star light curves, using large datasets from the Kepler and K2 missions. Special emphasis is placed on Cepheids, RR Lyrae, and eclipsing binaries, examining the influence of observational cadence and phase distribution on classification precision. Employing automated deep learning optimization, we achieve striking performance using 2 architectures: one that combines one-dimensional convolution (Conv1D) with bidirectional long short-term memory (BiLSTM) and another called the Swin Transformer. These achieved accuracies of 94% and 99%, respectively, with the latter demonstrating a notable 83% accuracy in discerning the elusive type II Cepheids that comprise merely 0.02% of the total dataset. We unveil StarWhisper LightCurve (LC), a series of 3 LLM models based on an LLM, a multimodal large language model (MLLM), and a large audio language model (LALM). Each model is fine-tuned with strategic prompt engineering and customized training methods to explore the emergent abilities of these models for astronomical data. Remarkably, StarWhisper LC series models exhibit high accuracies of around 90%, considerably reducing the need for explicit feature engineering, thereby paving the way for streamlined parallel data processing and the progression of multifaceted multimodal models in astronomical applications. The study furnishes 2 detailed catalogs illustrating the impacts of phase and sampling intervals on deep learning classification accuracy, showing that a substantial decrease of up to 14% in observation duration and 21% in sampling points can be realized without compromising accuracy by more than 10%.

  • Discussion
  • Cite Count Icon 2
  • 10.1111/cogs.13430
Large Language Models: A Historical and Sociocultural Perspective.
  • Mar 1, 2024
  • Cognitive Science
  • Eugene Yu Ji

This letter explores the intricate historical and contemporary links between large language models (LLMs) and cognitive science through the lens of information theory, statistical language models, and socioanthropological linguistic theories. The emergence of LLMs highlights the enduring significance of information-based and statistical learning theories in understanding human communication. These theories, initially proposed in the mid-20th century, offered a visionary framework for integrating computational science, social sciences, and humanities, which nonetheless was not fully fulfilled at that time. The subsequent development of sociolinguistics and linguistic anthropology, especially since the 1970s, provided critical perspectives and empirical methods that both challenged and enriched this framework. This letter proposes that two pivotal concepts derived from this development, metapragmatic function and indexicality, offer a fruitful theoretical perspective for integrating the semantic, textual, and pragmatic, contextual dimensions of communication, an amalgamation that contemporary LLMs have yet to fully achieve. The author believes that contemporary cognitive science is at a crucial crossroads, where fostering interdisciplinary dialogues among computational linguistics, social linguistics and linguistic anthropology, and cognitive and social psychology is in particular imperative. Such collaboration is vital to bridge the computational, cognitive, and sociocultural aspects of human communication and human-AI interaction, especially in the era of large language and multimodal models and human-centric Artificial Intelligence (AI).

  • Research Article
  • Cite Count Icon 74
  • 10.1148/radiol.240273
Comparing Diagnostic Accuracy of Radiologists versus GPT-4V and Gemini Pro Vision Using Image Inputs from Diagnosis Please Cases.
  • Jul 1, 2024
  • Radiology
  • Pae Sun Suh + 14 more

Background The diagnostic abilities of multimodal large language models (LLMs) using direct image inputs and the impact of the temperature parameter of LLMs remain unexplored. Purpose To investigate the ability of GPT-4V and Gemini Pro Vision in generating differential diagnoses at different temperatures compared with radiologists using Radiology Diagnosis Please cases. Materials and Methods This retrospective study included Diagnosis Please cases published from January 2008 to October 2023. Input images included original images and captures of the textual patient history and figure legends (without imaging findings) from PDF files of each case. The LLMs were tasked with providing three differential diagnoses, repeated five times at temperatures 0, 0.5, and 1. Eight subspecialty-trained radiologists solved cases. An experienced radiologist compared generated and final diagnoses, considering the result correct if the generated diagnoses included the final diagnosis after five repetitions. Accuracy was assessed across models, temperatures, and radiology subspecialties, with statistical significance set at P < .007 after Bonferroni correction for multiple comparisons across the LLMs at the three temperatures and with radiologists. Results A total of 190 cases were included in neuroradiology (n = 53), multisystem (n = 27), gastrointestinal (n = 25), genitourinary (n = 23), musculoskeletal (n = 17), chest (n = 16), cardiovascular (n = 12), pediatric (n = 12), and breast (n = 5) subspecialties. Overall accuracy improved with increasing temperature settings (0, 0.5, 1) for both GPT-4V (41% [78 of 190 cases], 45% [86 of 190 cases], 49% [93 of 190 cases], respectively) and Gemini Pro Vision (29% [55 of 190 cases], 36% [69 of 190 cases], 39% [74 of 190 cases], respectively), although there was no evidence of a statistically significant difference after Bonferroni adjustment (GPT-4V, P = .12; Gemini Pro Vision, P = .04). The overall accuracy of radiologists (61% [115 of 190 cases]) was higher than that of Gemini Pro Vision at temperature 1 (T1) (P < .001), while no statistically significant difference was observed between radiologists and GPT-4V at T1 after Bonferroni adjustment (P = .02). Radiologists (range, 45%-88%) outperformed the LLMs at T1 (range, 24%-75%) in most subspecialties. Conclusion Using direct radiologic image inputs, GPT-4V and Gemini Pro Vision showed improved diagnostic accuracy with increasing temperature settings. Although GPT-4V slightly underperformed compared with radiologists, it nonetheless demonstrated promising potential as a supportive tool in diagnostic decision-making. © RSNA, 2024 See also the editorial by Nishino and Ballard in this issue.

  • Research Article
  • Cite Count Icon 4
  • 10.1038/s41698-025-00916-7
Evaluating the performance of large language & visual-language models in cervical cytology screening
  • May 23, 2025
  • npj Precision Oncology
  • Qi Hong + 15 more

Large language models (LLMs) and large visual-language models (LVLMs) have exhibited near-human levels of knowledge, image comprehension, and reasoning abilities, and their performance has undergone evaluation in some healthcare domains. However, a systematic evaluation of their capabilities in cervical cytology screening has yet to be conducted. Here, we constructed CCBench, a benchmark dataset dedicated to the evaluation of LLMs and LVLMs in cervical cytology screening, and developed a GPT-based semi-automatic evaluation pipeline to assess the performance of six LLMs (GPT-4, Bard, Claude-2.0, LLaMa-2, Qwen-Max, and ERNIE-Bot-4.0) and five LVLMs (GPT-4V, Gemini, LLaVA, Qwen-VL, and ViLT) on this dataset. CCBench comprises 773 question-answer (QA) pairs and 420 visual-question-answer (VQA) triplets, making it the first dataset in cervical cytology to include both QA and VQA data. We found that LLMs and LVLMs demonstrate promising accuracy and specialization in cervical cytology screening. GPT-4 achieved the best performance on the QA dataset, with an accuracy of 70.5% for close-ended questions and average expert evaluation score of 6.9/10 for open-ended questions. On the VQA dataset, Gemini achieved the highest accuracy for close-ended questions at 67.8%, while GPT-4V attained the highest expert evaluation score of 6.1/10 for open-ended questions. Besides, LLMs and LVLMs revealed varying abilities in answering questions across different topics and difficulty levels. However, their performance remains inferior to the expertise exhibited by cytopathology professionals, and the risk of generating misinformation could lead to potential harm. Therefore, substantial improvements are required before these models can be reliably deployed in clinical practice.

  • Supplementary Content
  • 10.1108/ir-02-2025-0074
Large language and vision-language models for robot: safety challenges, mitigation strategies and future directions
  • Jul 29, 2025
  • Industrial Robot: the international journal of robotics research and application
  • Xiangyu Hu + 1 more

Purpose This study aims to explore the integration of large language models (LLMs) and vision-language models (VLMs) in robotics, highlighting their potential benefits and the safety challenges they introduce, including robustness issues, adversarial vulnerabilities, privacy concerns and ethical implications. Design/methodology/approach This survey conducts a comprehensive analysis of the safety risks associated with LLM- and VLM-powered robotic systems. The authors review existing literature, analyze key challenges, evaluate current mitigation strategies and propose future research directions. Findings The study identifies that ensuring the safety of LLM-/VLM-driven robots requires a multi-faceted approach. While current mitigation strategies address certain risks, gaps remain in real-time monitoring, adversarial robustness and ethical safeguards. Originality/value This study offers a structured and comprehensive overview of the safety challenges in LLM-/VLM-driven robotics. It contributes to ongoing discussions by integrating technical, ethical and regulatory perspectives to guide future advancements in safe and responsible artificial intelligence-driven robotics.

  • Research Article
  • 10.28945/5693
Unlocking the Potential of Large Language Models in Education: Factors Influencing Adoption by Instructional Designers and Academics
  • Jan 1, 2026
  • Journal of Information Technology Education: Research
  • Katherine L Fourie + 2 more

Aim/Purpose: The study investigates the factors influencing the acceptance and utilisation of large language models (LLMs) (predictor variables of LLM usage), such as ChatGPT, in Learning design by instructional designers and university-teaching academics from various countries. Background: Large language models (LLMs) have exploded onto the scene, transforming the landscape of learning design. Instructional designers and university teaching academics have been overburdened with content creation for their teaching programmes, and the arrival of LLM models will help in this regard by developing more interactive content that drives student engagement and, in turn, contributes to student success. Since LLMs are a relatively new phenomenon, little is known about the factors influencing their acceptance in learning design; therefore, this research is needed, as learning design principles are the bedrock of student engagement and success. Methodology: A cross-sectional correlational quantitative study was employed. Data was collected using an online questionnaire posted on social media, including LinkedIn, from 203 instructional designers and university teaching academics. Purposive and snowball sampling methods were used to target instructional designers and university teaching academics at colleges and universities worldwide. Participants were asked to share the survey link with fellow instructional designers and university-teaching academics in their communities. The factor structure of the data was determined using exploratory factor analysis. Nonetheless, the factor structure derived from the LLMs did not entirely reflect the original configuration of the Unified Theory of Acceptance and Use of Technology (UTAUT3), as certain predictors appeared to coalesce, indicating LLMs’ unique nature in learning design. Confirmatory factor analysis was used to verify the fit of the data on the measurement model. First-order and second-order structural modelling were used to identify the structural relationships among the variables. Contribution: The study determines significant factors for the acceptance of LLMs by instructional designers and academic teaching staff in learning design, enabling possible opportunities for best practices in the field through interventions to optimize LLM usage. The study applies the technology acceptance model to the emerging LLM technology and extends the technology acceptance model by adding the trust construct as a predictor variable. Findings: The structural analysis results indicated that the ingrained LLM practices, LLM peer-driven expectations, innovative propensity towards LLM adoption, reliability and provider trust in LLMs, and ease of use and support influenced perceived LLM benefits and usage, but community standards and infrastructure had no influence. The second-order structural equation modelling indicated that perceived LLM benefits and usage and ingrained LLM habits contributed most to the learning design. Recommendations for Practitioners: Teaching academics and instructional designers must use LLMs in designing content, assessments, and interactive learning activities, and attend LLM training workshops on prompting and best practices in integrating LLMs into learning and teaching to see their benefits; hence, regular use of LLMs will then lead to trust and innovation in LLMs usage, enhancing learning design and improving student learning outcomes. Recommendation for Researchers: Researchers must use mixed methods approaches to have a deeper understanding of the factors influencing LLMs. Since habit and perceived LLM benefits and usage contributed the most variance to learning design, researchers must investigate strategies that optimise these factors in learning design, such as effective intervention strategies that can help form positive LLM habits. In addition, the findings provide researchers with a starting point for future research. Further researchers must investigate interventions that optimise the influence of personal innovativeness and trust that contributed the least variance to learning design, hence unlocking the potential of LLMs in learning design through innovation, responsible, and ethical use. Impact on Society: The use of LLMs in learning design has a high possibility of transforming education, specifically the learning design landscape. Using LLMs will free up more time for teaching academics and instructional designers so that they spend more time on higher-order thinking skill demands. Consequently, the students will be exposed to more engaging and interactive content, resulting in improved learning outcomes. Future Research: Future research must include context-derived external variables in technology acceptance models, such as levels of prompting competencies, to provide a deeper understanding of LLMs. In addition, future research must be based on the application and impact of LLMs on student engagement and success, and their attainment of 21st-century skills.

  • Research Article
  • Cite Count Icon 1
  • 10.1080/13658816.2025.2577252
Extraction of geoprocessing modeling knowledge from crowdsourced Google Earth Engine scripts by coordinating large and small language models
  • Nov 1, 2025
  • International Journal of Geographical Information Science
  • Anqi Zhao + 7 more

The widespread use of online geoinformation platforms, such as Google Earth Engine (GEE), has produced numerous scripts. Extracting domain knowledge from these crowdsourced scripts supports understanding of geoprocessing workflows. Small Language Models (SLMs) are effective for semantic embedding but struggle with complex code; Large Language Models (LLMs) can summarize scripts, yet lack consistent geoscience terminology to express knowledge. In this paper, we propose Geo-CLASS, a knowledge extraction framework for geospatial analysis scripts that coordinates large and small language models. Specifically, we designed domain-specific schemas and a schema-aware prompt strategy to guide LLMs to generate and associate entity descriptions, and employed SLMs to standardize the outputs by mapping these descriptions to a constructed geoscience knowledge base. Experiments on 237 GEE scripts, selected from 295,943 scripts in total, demonstrated that our framework outperformed LLM baselines, including Llama-3, GPT-3.5 and GPT-4o. In comparison, the proposed framework improved accuracy in recognizing entities and relations by up to 31.9% and 12.0%, respectively. Ablation studies and performance analysis further confirmed the effectiveness of key components and the robustness of the framework. Geo-CLASS has the potential to enable the construction of geoprocessing modeling knowledge graphs, facilitate domain-specific reasoning and advance script generation via Retrieval-Augmented Generation (RAG).

  • Research Article
  • Cite Count Icon 1
  • 10.1016/j.tvjl.2025.106476
Reading systemic disease from the feline fundus: A multicenter comparison of multimodal large language models and veterinary clinicians.
  • Dec 1, 2025
  • Veterinary journal (London, England : 1997)
  • Emre Eren + 6 more

Reading systemic disease from the feline fundus: A multicenter comparison of multimodal large language models and veterinary clinicians.

  • PDF Download Icon
  • Preprint Article
  • 10.2196/preprints.68320
Knowledge Enhancement of Small-Scale Models in Medical Question Answering (Preprint)
  • Nov 3, 2024
  • Xinbai Li + 3 more

BACKGROUND Medical question answering (QA) is essential for various medical applications. While small-scale pre-training language models (PLMs) are widely adopted in open-domain QA tasks through fine-tuning with related datasets, applying this approach in the medical domain requires significant and rigorous integration of external knowledge. Knowledge-enhanced small-scale PLMs have been proposed to incorporate knowledge bases (KBs) to improve performance, as KBs contain vast amounts of factual knowledge. Large language models (LLMs) contain a vast amount of knowledge and have attracted significant research interest due to their outstanding natural language processing (NLP) capabilities. KBs and LLMs can provide external knowledge to enhance small-scale models in medical QA. OBJECTIVE KBs consist of structured factual knowledge that must be converted into sentences to align with the input format of PLMs. However, these converted sentences often lack semantic coherence, potentially causing them to deviate from the intrinsic knowledge of KBs. LLMs, on the other hand, can generate natural, semantically rich sentences, but they may also produce irrelevant or inaccurate statements. Retrieval-augmented generation (RAG) paradigm enhances LLMs by retrieving relevant information from an external database before responding. By integrating LLMs and KBs using the RAG paradigm, it is possible to generate statements that combine the factual knowledge of KBs with the semantic richness of LLMs, thereby enhancing the performance of small-scale models. In this paper, we explore a RAG fine-tuning method, RAG-mQA, that combines KBs and LLMs to improve small-scale models in medical QA. METHODS In the RAG fine-tuning scenario, we adopt medical KBs as an external database to augment the text generation of LLMs, producing statements that integrate medical domain knowledge with semantic knowledge. Specifically, KBs are used to extract medical concepts from the input text, while LLMs are tasked with generating statements based on these extracted concepts. In addition, we introduce two strategies for constructing knowledge: KB-based and LLM-based construction. In the KB-based scenario, we extract medical concepts from the input text using KBs and convert them into sentences by connecting the concepts sequentially. In the LLM-based scenario, we provide the input text to an LLM, which generates relevant statements to answer the question. For downstream QA tasks, the knowledge produced by these three strategies is inserted into the input text to fine-tune a small-scale PLM. F1 and exact match (EM) scores are employed as evaluation metrics for performance comparison. Fine-tuned PLMs without knowledge insertion serve as baselines. Experiments are conducted on two medical QA datasets: emrQA (English) and MedicalQA (Chinese). RESULTS RAG-mQA achieved the best results on both datasets. On the MedicalQA dataset, compared to the KB-based and LLM-based enhancement methods, RAG-mQA improved the F1 score by 0.59% and 2.36%, and the EM score by 2.96% and 11.18%, respectively. On the emrQA dataset, the EM score of RAG-mQA exceeded those of the KB-based and LLM-based methods by 4.65% and 7.01%, respectively. CONCLUSIONS Experimental results demonstrate that RAG fine-tuning method can improve the model performance in medical QA. RAG-mQA achieves greater improvements compared to other knowledge-enhanced methods. CLINICALTRIAL This study does not involve trial registration.

  • Research Article
  • 10.64898/2026.02.03.26345402
Evaluating Diagnostic Accuracy and Clinical Reasoning of Multiple Large Language Models in Psychiatry.
  • Feb 11, 2026
  • medRxiv : the preprint server for health sciences
  • Kevin W Jin + 20 more

Existing large language model (LLM) evaluations rely on accuracy benchmarks that fail to capture whether models reason well while making diagnoses. Studies that do analyse reasoning focus on post hoc explanations accompanying model outputs rather than distinct, clinician-visible artifacts such as detailed reasoning traces. This creates a translational gap in domains such as psychiatry, where diagnosis relies on narrative interpretation, diagnostic reasoning, and clinical judgment under uncertainty. We conducted a mixed-methods evaluation of four state-of-the-art LLMs using a clinician-curated dataset of 196 psychiatric case vignettes, including 135 published cases and 61 novel clinician-authored vignettes. Diagnostic accuracy was assessed using multiple metrics (top-1 accuracy, top-5 accuracy, recall@5, and mean reciprocal rank) based on ranked lists of five differential diagnoses per vignette. Clinical reasoning quality was evaluated on a randomly selected subset of 30 vignettes through clinician assessment of model-generated diagnostic reasoning traces alongside qualitative commentary from board-certified psychiatrists. We examined the association between clinician-rated reasoning quality and diagnostic correctness and included an illustrative comparison with psychiatry residents. Clinician-rated diagnostic reasoning quality was strongly associated with diagnostic correctness in mixed-effects logistic regression analyses (β = 1·80; p < 0·001), whereas data extraction quality alone was not. Across the full vignette set, models demonstrated moderate to high diagnostic accuracy. The highest-performing model achieved a top-5 accuracy of 0·801 and also received the highest clinician-rated reasoning scores. In an illustrative comparison, model diagnostic accuracy fell within the range observed for psychiatry residents. Diagnostic reasoning quality captures clinically meaningful variation in LLM performance beyond accuracy metrics. Psychiatry may represent a stringent testbed for evaluating reasoning in narrative-driven clinical domains. Evaluation frameworks for LLM-based clinical decision support should incorporate structured assessment of reasoning processes, not accuracy alone. Evidence before this study: We searched PubMed and Scopus for studies evaluating large language models for psychiatric diagnosis and/or differential diagnosis from text vignettes. Searches were run from database inception to February 6, 2026, using terms including ("large language model" OR LLM "artificial intelligence" OR "generative AI" OR AI OR ChatGPT OR GPT OR Claude OR Gemini OR DeepSeek OR Llama) AND (psychiatr* OR mental OR DSM OR "differential diagnosis" OR diagnos*) AND (vignette OR case OR "case report"). We included empirical studies that evaluated model diagnostic outputs using psychiatric cases/vignettes and excluded editorials/commentaries and studies that did not report case-level diagnostic performance. We did not formally assess risk of bias/quality; studies were heterogeneous in vignette sources, model access, and outcome definitions, precluding quantitative pooling. Prior studies have typically examined small vignette sets, focused on narrow diagnostic domains, evaluated single models, or relied primarily on outcome-based accuracy metrics. When diagnostic reasoning has been assessed, it has usually been inferred from post hoc explanations accompanying model outputs rather than evaluated as a distinct, clinician-visible artifact. Clinician-grounded evaluations of diagnostic reasoning across multiple contemporary models remain limited.Added value of this study: This study provides a large-scale, clinician-grounded evaluation of diagnostic accuracy and diagnostic reasoning quality across four contemporary large language models using a diverse dataset of psychiatric case vignettes. Rather than relying solely on outcome-based explanations, we directly evaluated model-generated diagnostic reasoning traces as clinician-visible artifacts using structured clinician ratings and qualitative analysis. By integrating multiple accuracy metrics with clinician assessment of reasoning coherence, flexibility, and plausibility, we demonstrate that clinician-rated reasoning quality is strongly associated with diagnostic correctness, whereas data extraction quality alone is not. Our analysis also identifies recurrent reasoning failure modes not captured by accuracy metrics, highlighting psychiatry as a stringent testbed for evaluating reasoning in narrative-driven clinical domains.Implications of all the available evidence: Evaluations of large language models for clinical decision support should extend beyond accuracy to include systematic assessment of clinician-visible diagnostic reasoning. Mixed-methods, clinician-grounded evaluation frameworks that examine both diagnostic outcomes and reasoning artifacts may be critical for responsible assessment of LLMs in psychiatry and other areas of medicine where diagnosis depends on interpretation, judgment, and tolerance of uncertainty.

  • Conference Article
  • Cite Count Icon 135
  • 10.1145/3510003.3510203
Jigsaw
  • May 21, 2022
  • Naman Jain + 6 more

Large pre-trained language models such as GPT-3 [10], Codex [11], and Google's language model [7] are now capable of generating code from natural language specifications of programmer intent. We view these developments with a mixture of optimism and caution. On the optimistic side, such large language models have the potential to improve productivity by providing an automated AI pair programmer for every programmer in the world. On the cautionary side, since these large language models do not understand program semantics, they offer no guarantees about quality of the suggested code. In this paper, we present an approach to augment these large language models with post-processing steps based on program analysis and synthesis techniques, that understand the syntax and semantics of programs. Further, we show that such techniques can make use of user feedback and improve with usage. We present our experiences from building and evaluating such a tool Jigsaw, targeted at synthesizing code for using Python Pandas API using multi-modal inputs. Our experience suggests that as these large language models evolve for synthesizing code from intent, Jigsaw has an important role to play in improving the accuracy of the systems.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant