Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Performance of large language models on the Japanese cardiovascular surgery board examination: a comparative analysis of eight contemporary AI models with educational implications.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Large language models (LLMs) have shown strong performance on medical licensing and specialty examinations, but their utility in cardiovascular surgery certification and education remains unknown. We evaluated eight LLMs (GPT-5, GPT-4o [OpenAI], Gemini-2.5Pro, Gemini-2.0, Gemma-3 [12B] [Google DeepMind], Claude-4, Claude-3 [Anthropic], and Llama-4 [Scout] [Meta AI]) using their official application programming interfaces as of September 2025. Examination items were obtained from the Japanese Cardiovascular Surgery Board (2021-2024; 523 questions). Texts were extracted from PDFs, images converted to JPEGs, and each question presented with a standardized Japanese prompt. Models produced three responses per item; final answers were determined by majority voting. Accuracy with 95% confidence intervals was calculated, and pairwise comparisons performed using McNemar's test. Across 523 items, GPT-5 achieved the highest accuracy (87.4%), followed by Gemini-2.5Pro (85.7%); their performance did not differ significantly. Claude-4 ranked third (80.3%), exceeding the passing threshold in some years. GPT-4o (65.6%) and Gemini-2.0 (58.5%) showed moderate accuracy, whereas Claude-3 (36.9%), Gemma-3 (40.9%), and Llama-4 (52.2%) scored lower. Pairwise testing confirmed a clear stratification: high (GPT-5, Gemini-2.5Pro), upper-intermediate (Claude-4), mid (GPT-4o, Gemini-2.0), and low (others). All models declined with image-based items, with top models reduced to around 70%. Accuracy remained stable across years, with GPT-5, Gemini-2.5Pro, and occasionally Claude-4, surpassing the pass threshold. Successive model generations, particularly GPT-5 and Gemini-2.5Pro, consistently achieved passing-level accuracy. These findings highlight substantial gains through model evolution and underscore the potential of LLMs as supplementary tools for specialty education, despite persistent limitations in image-based reasoning.

Similar Papers
  • Research Article
  • Cite Count Icon 2
  • 10.3352/jeehp.2025.22.36
Performance of large language models in medical licensing examinations: a systematic review and meta-analysis.
  • Nov 18, 2025
  • Journal of educational evaluation for health professions
  • Haniyeh Nouri + 5 more

This study systematically evaluates and compares the performance of large language models (LLMs) in answering medical licensing examination questions. By conducting subgroup analyses based on language, question format, and model type, this meta-analysis aims to provide a comprehensive overview of LLM capabilities in medical education and clinical decision-making. This systematic review, registered in PROSPERO and following PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines, searched MEDLINE (PubMed), Scopus, and Web of Science for relevant articles published up to February 1, 2025. The search strategy included Medical Subject Headings (MeSH) terms and keywords related to ("ChatGPT" OR "GPT" OR "LLM variants") AND ("medical licensing exam*" OR "medical exam*" OR "medical education" OR "radiology exam*"). Eligible studies evaluated LLM accuracy on medical licensing examination questions. Pooled accuracy was estimated using a random-effects model, with subgroup analyses by LLM type, language, and question format. Publication bias was assessed using Egger's regression test. This systematic review identified 2,404 studies. After removing duplicates and excluding irrelevant articles through title and abstract screening, 36 studies were included after full-text review. The pooled accuracy was 72% (95% confidence interval, 70.0% to 75.0%) with high heterogeneity (I2=99%, P<0.001). Among LLMs, GPT-4 achieved the highest accuracy (81%), followed by Bing (79%), Claude (74%), Gemini/Bard (70%), and GPT-3.5 (60%) (P=0.001). Performance differences across languages (range, 62% in Polish to 77% in German) were not statistically significant (P=0.170). LLMs, particularly GPT-4, can match or exceed medical students' examination performance and may serve as supportive educational tools. However, due to variability and the risk of errors, they should be used cautiously as complements rather than replacements for traditional learning methods.

  • Research Article
  • 10.1136/bmjopen-2025-108775
Performance of large language models (GPT-5.2, Gemini 3 Pro, Claude Sonnet 4.6 and Grok 4.1) on the Fellowship of The Royal College of Surgeons Urology Part A examination
  • May 13, 2026
  • BMJ Open
  • Abdul Rhaman Kafagi + 2 more

ObjectivesTo assess and compare the performance of four contemporary frontier large language models (LLMs)—GPT-5.2 (OpenAI), Gemini 3 Pro (Google DeepMind), Claude Sonnet 4.6 (Anthropic) and Grok 4.1 (xAI)—on a simulated Fellowship of The Royal College of Surgeons Urology (FRCS(Urol)) Part A examination, evaluating overall accuracy, subspecialty-level performance, output consistency and response time.DesignControlled comparative evaluation study using a standardised simulation framework with repeated independent testing runs per model.SettingAll models were accessed via their respective consumer-facing interfaces. No clinical setting or patient data were involved. Testing was conducted under uniform conditions with conversational memory disabled across all sessions.ParticipantsFour large language models were evaluated. No human participants were involved. Models were selected to represent the current frontier of publicly accessible LLMs from four distinct commercial developers. No models were excluded following selection.InterventionsEach model was presented with 240 FRCS (Urol) Part A single best answer questions, mapped to the Joint Committee on Intercollegiate Examinations' Urology Syllabus Blueprint (2023). A standardised prompt was delivered at the start of each session. Each model completed five independent examination runs. No fine-tuning or system-level modification was applied to any model.Primary and secondary outcome measuresThe primary outcome was overall examination accuracy for each model, benchmarked against an indicative pass threshold for the FRCS (Urol) Part A examination. Secondary outcomes were performance across 18 individual urology subspecialty topics; response time reported as mean total and per-question elapsed time; and consistency of performance quantified by SD and 95% CIs derived from a sequential Monte Carlo sampling procedure. All outcomes were prospectively planned and fully measured as specified.ResultsThree of four models exceeded the indicative 74% pass threshold: Gemini 3 Pro (82.4%±0.9%; 95% CI 81.3 to 83.6%), Claude Sonnet 4.6 (79.3%±1.1%; 95% CI 77.9 to 80.6%) and GPT-5.2 (76.1%±2.4%; 95% CI 73.1 to 79.1%). Grok 4.1 failed (70.4%±0.6%; 95% CI 69.6 to 71.2%), with its entire CI below 74%. All models completed the assessment in under 3 min. Strong performance was observed in research methodology (90–98%) and andrology (92–98%), with the weakest results in paediatric urology (38.7–54.7%) and testicular cancer (48.2–67.3%). Substantial within-model output instability was identified across several domains, most notably GPT-5.2 in female urology (SD±22.8%) and anatomy (SD±14.2%).ConclusionsThree of four frontier LLMs achieved scores consistent with passing the FRCS (Urol) Part A examination, representing a substantial advance since ChatGPT-3.5. Aggregate accuracy alone, however, obscures important subspecialty weaknesses and output instability. LLMs should be regarded as adjunctive revision aids rather than authoritative knowledge sources and always used alongside expert-led teaching. Future work should evaluate performance on Part B and viva-style assessments.

  • Research Article
  • 10.1186/s13019-026-04251-1
Evaluation of large language models in cardiovascular surgery: a comparative study of board-level clinical question answering and generation.
  • May 7, 2026
  • Journal of cardiothoracic surgery
  • Mehmet Inanc Yesilkaya + 1 more

Large language models (LLMs) are increasingly being explored in surgical training and clinical knowledge assessment. Although these models have demonstrated promising performance in standardized examinations, their performance in highly specialized fields such as cardiovascular surgery remains insufficiently investigated. This study aimed to evaluate the performance of current large language models in answering and generating board-level cardiovascular surgery questions reflecting guideline-based clinical reasoning. In this cross-sectional evaluation study, three large language models (ChatGPT-5.1, Gemini 3, and DeepSeek v3.2) were evaluated in two stages. In the first stage, the models answered 150 multiple-choice questions developed and validated by five cardiovascular surgery specialists using a Delphi process, designed to reflect the content scope and difficulty level of the American Board of Thoracic Surgery certification examination. Accuracy rates and pairwise comparisons were analyzed using the McNemar test. In the second stage, model-generated questions were evaluated by expert cardiovascular surgeons in terms of medical accuracy, clinical relevance, exam-level appropriateness, error type, and difficulty level. Statistical analyses included Spearman correlation, Wilcoxon signed-rank test, and chi-square analysis. The models demonstrated comparable accuracy rates (ChatGPT 80.7%; Gemini 78.7%; DeepSeek 82.0%), with no statistically significant differences between them. Question difficulty level was not associated with model accuracy. Error distribution differed significantly between models (χ² = 8.1; p = 0.02), with Gemini demonstrating the highest rate of valid question generation and DeepSeek showing a higher rate of major errors. A significant positive correlation was observed between model- and expert-assigned difficulty levels. Current large language models demonstrate strong performance in board-level cardiovascular surgery knowledge assessment. However, the presence of major errors and variability in difficulty calibration, together with known limitations in clinical reasoning, indicate that these systems should be used cautiously as supportive tools in surgical training and knowledge assessment rather than as substitutes for clinical decision-making.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 9
  • 10.1001/jamanetworkopen.2025.6359
Semantic Clinical Artificial Intelligence vs Native Large Language Model Performance on the USMLE
  • Apr 22, 2025
  • JAMA Network Open
  • Peter L Elkin + 9 more

Large language models (LLMs) are being implemented in health care. Enhanced accuracy and methods to maintain accuracy over time are needed to maximize LLM benefits. To evaluate whether LLM performance on the US Medical Licensing Examination (USMLE) can be improved by including formally represented semantic clinical knowledge. This comparative effectiveness research study was conducted between June 2024 and February 2025 at the Department of Biomedical Informatics, Jacobs School of Medicine and Biomedical Sciences, University at Buffalo, Buffalo, New York, using sample questions from the USMLE Steps 1, 2, and 3. Semantic clinical artificial intelligence (SCAI) was developed to insert formally represented semantic clinical knowledge into LLMs using retrieval augmented generation (RAG). The SCAI method was evaluated by comparing the performance of 3 Llama LLMs (13B, 70B, and 405B; Meta) with and without SCAI RAG on text-based questions from the USMLE Steps 1, 2, and 3. LLM accuracy for answering questions was determined by comparing the LLM output with the USMLE answer key. The LLMs were tested on 87 questions in the USMLE Step 1, 103 in Step 2, and 123 in Step 3. The 13B LLM enhanced by SCAI RAG was associated with significantly improved performance on Steps 1 and 3 but only met the 60% passing threshold on Step 3 (74 questions correct [60.2%]). The 70B and 405B LLMs passed all the USMLE steps with and without SCAI RAG. The SCAI RAG 70B model scored 80 questions (92.0%) correctly on Step 1, 82 (79.6%) on Step 2, and 112 (91.1%) on Step 3. The SCAI RAG 405B model scored 79 (90.8%) correctly on Step 1, 87 (84.5%) on Step 2, and 117 (95.1%) on Step 3. Significant improvements associated with SCAI RAG were found for the 13B model on Steps 1 and 3, the 70B model on Step 2, and the 405B parameter model on Step 3. The 70B model was significantly better than the 13B model, and the 405B model was not significantly better than the 70B model. In this comparative effectiveness research study, SCAI RAG was associated with significantly improved scores on the USMLE Steps 1, 2, and 3. The 13B model passed Step 3 with RAG, and the 70B and 405B models passed and scored well on Steps 1, 2, and 3 with or without augmentation. New forms of reasoning by LLMs, like semantic reasoning, have potential to improve the accuracy of LLM performance on important medical questions. Improving LLM performance in health care with targeted, up-to-date clinical knowledge is an important step in LLM implementation and acceptance.

  • Research Article
  • Cite Count Icon 117
  • 10.1001/jamanetworkopen.2023.46721
Performance of Large Language Models on a Neurology Board–Style Examination
  • Dec 7, 2023
  • JAMA network open
  • Marc Cicero Schubert + 2 more

Recent advancements in large language models (LLMs) have shown potential in a wide array of applications, including health care. While LLMs showed heterogeneous results across specialized medical board examinations, the performance of these models in neurology board examinations remains unexplored. To assess the performance of LLMs on neurology board-style examinations. This cross-sectional study was conducted between May 17 and May 31, 2023. The evaluation utilized a question bank approved by the American Board of Psychiatry and Neurology and was validated with a small question cohort by the European Board for Neurology. All questions were categorized into lower-order (recall, understanding) and higher-order (apply, analyze, synthesize) questions based on the Bloom taxonomy for learning and assessment. Performance by LLM ChatGPT versions 3.5 (LLM 1) and 4 (LLM 2) was assessed in relation to overall scores, question type, and topics, along with the confidence level and reproducibility of answers. Overall percentage scores of 2 LLMs. LLM 2 significantly outperformed LLM 1 by correctly answering 1662 of 1956 questions (85.0%) vs 1306 questions (66.8%) for LLM 1. Notably, LLM 2's performance was greater than the mean human score of 73.8%, effectively achieving near-passing and passing grades in the neurology board examination. LLM 2 outperformed human users in behavioral, cognitive, and psychological-related questions and demonstrated superior performance to LLM 1 in 6 categories. Both LLMs performed better on lower-order than higher-order questions, with LLM 2 excelling in both lower-order and higher-order questions. Both models consistently used confident language, even when providing incorrect answers. Reproducible answers of both LLMs were associated with a higher percentage of correct answers than inconsistent answers. Despite the absence of neurology-specific training, LLM 2 demonstrated commendable performance, whereas LLM 1 performed slightly below the human average. While higher-order cognitive tasks were more challenging for both models, LLM 2's results were equivalent to passing grades in specialized neurology examinations. These findings suggest that LLMs could have significant applications in clinical neurology and health care with further refinements.

  • Research Article
  • Cite Count Icon 13
  • 10.1148/ryai.240313
Enhancing Large Language Models with Retrieval-Augmented Generation: A Radiology-Specific Approach.
  • May 1, 2025
  • Radiology. Artificial intelligence
  • Dane A Weinert + 1 more

Retrieval-augmented generation (RAG) is a strategy to improve the performance of large language models (LLMs) by providing an LLM with an updated corpus of knowledge that can be used for answer generation in real time. RAG may improve LLM performance and clinical applicability in radiology by providing citable, up-to-date information without requiring model fine-tuning. In this retrospective study, a radiology-specific RAG system was developed using a vector database of 3689 RadioGraphics articles published from January 1999 to December 2023. Performance of five LLMs with (RAG-Systems) and without RAG on a 192-question radiology examination was compared. RAG significantly improved examination scores for GPT-4 (OpenAI; 81.2% vs 75.5%, P = .04) and Command R+ (Cohere; 70.3% vs 62.0%, P = .02), but not for Claude Opus (Anthropic), Mixtral (Mistral AI), or Gemini 1.5 Pro (Google DeepMind). RAG-Systems performed significantly better than pure LLMs on a 24-question subset directly sourced from RadioGraphics (85% vs 76%, P = .03). RAG-Systems retrieved 21 of 24 (87.5%, P < .001) relevant RadioGraphics references cited in the examination's answer explanations and successfully cited them in 18 of 21 (85.7%, P < .001) outputs. The results suggest that RAG is a promising approach to enhance LLM capabilities for radiology knowledge tasks, providing transparent, domain-specific information retrieval. Keywords: Computer Applications-General (Informatics), Technology Assessment Supplemental material is available for this article. © RSNA, 2025 See also commentary by Mansuri and Gichoya in this issue.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 3
  • 10.1038/s41598-025-20496-7
Reasoning-based LLMs surpass average human performance on medical social skills
  • Oct 17, 2025
  • Scientific Reports
  • Khalid Ibraheem Alohali + 4 more

A significant portion of medical licensing examinations assesses key social skills such as communication, ethics, and professionalism, which are vital for quality patient care. Artificial intelligence (AI) has been increasingly integrated into healthcare systems in recent years, raising concerns among regulators, providers, and patients regarding AI’s capacity to handle complex, human-centered scenarios. Previous work has shown that large language models (LLMs) like GPT-3.5 and GPT-4 perform well on social skills questions from the United States Medical Licensing Examination (USMLE). However, newer models like GPT-4o, Gemini 1.5 Pro, and o1 have been introduced, with the latter designed to mimic human thinking through a “chain of thought” reasoning, unlike other LLMs that provide instantaneous answers. The impact of reasoning on LLMs’ ability to navigate scenarios requiring social skills remains unclear. Here, we evaluate five LLMs: GPT-4, GPT-4o, Gemini 1.5 Pro, and o1-preview, and its full version, o1; using forty USMLE-style social skills questions from the UWORLD question bank covering several categories: communication & interpersonal skills, healthcare policy & economics, system-based practice & quality improvement, and medical ethics & jurisprudence. After each LLM answered, it was subjected to an “Are you sure?” follow-up prompt to test consistency. Our results show that o1, the reasoning model, came out on top with 39 out of 40 correct final answers (97.5%). GPT-4o and Gemini 1.5 Pro (87.5%) tied in second place, followed by o1-preview (77.5%) and lastly GPT-4 (75%). All LLMs surpassed the UWORLD question bank’s 64% average. Domain-specific analysis revealed that despite having equal overall scores, GPT-4o and Gemini 1.5 Pro -developed by two different companies- had varying strengths. GPT-4o demonstrated its greatest strengths in communication & interpersonal skills and patient safety, while Gemini 1.5 Pro achieved perfect scores in healthcare policy & economics, system-based practice & quality improvement, and medical ethics & jurisprudence. Although o1-preview demonstrated strong initial performance, its inconsistency under skepticism; changing answers frequently, primarily to incorrect ones, reduced its overall ranking from second to fourth. This phenomenon was not observed in any other model, including the final o1 release, which maintained consistent, high-level performance. These findings, along with prior work, highlight the potential of LLMs to demonstrate effectiveness at answering knowledge-based social skills questions in a medical context, sometimes surpassing average human performance. As LLMs continue to grow in size and sophistication, their performance is expected to improve further. In particular, the strong performance of reasoning-based LLMs suggests that such architectures hold significant promise for advancing AI’s role in socially oriented tasks. These results demonstrate the growing potential for reasoning-based LLMs to complement and enhance clinical training, medical education, and patient care.Supplementary InformationThe online version contains supplementary material available at 10.1038/s41598-025-20496-7.

  • Research Article
  • Cite Count Icon 4
  • 10.1038/s41698-025-00916-7
Evaluating the performance of large language & visual-language models in cervical cytology screening
  • May 23, 2025
  • npj Precision Oncology
  • Qi Hong + 15 more

Large language models (LLMs) and large visual-language models (LVLMs) have exhibited near-human levels of knowledge, image comprehension, and reasoning abilities, and their performance has undergone evaluation in some healthcare domains. However, a systematic evaluation of their capabilities in cervical cytology screening has yet to be conducted. Here, we constructed CCBench, a benchmark dataset dedicated to the evaluation of LLMs and LVLMs in cervical cytology screening, and developed a GPT-based semi-automatic evaluation pipeline to assess the performance of six LLMs (GPT-4, Bard, Claude-2.0, LLaMa-2, Qwen-Max, and ERNIE-Bot-4.0) and five LVLMs (GPT-4V, Gemini, LLaVA, Qwen-VL, and ViLT) on this dataset. CCBench comprises 773 question-answer (QA) pairs and 420 visual-question-answer (VQA) triplets, making it the first dataset in cervical cytology to include both QA and VQA data. We found that LLMs and LVLMs demonstrate promising accuracy and specialization in cervical cytology screening. GPT-4 achieved the best performance on the QA dataset, with an accuracy of 70.5% for close-ended questions and average expert evaluation score of 6.9/10 for open-ended questions. On the VQA dataset, Gemini achieved the highest accuracy for close-ended questions at 67.8%, while GPT-4V attained the highest expert evaluation score of 6.1/10 for open-ended questions. Besides, LLMs and LVLMs revealed varying abilities in answering questions across different topics and difficulty levels. However, their performance remains inferior to the expertise exhibited by cytopathology professionals, and the risk of generating misinformation could lead to potential harm. Therefore, substantial improvements are required before these models can be reliably deployed in clinical practice.

  • Research Article
  • 10.14423/smj.0000000000001950
Performance of Large Language Models on Diagnostic Radiology Board-Style Questions: A Comparative Evaluation of GPT-4o, Perplexity AI, and OpenEvidence.
  • Apr 3, 2026
  • Southern medical journal
  • Randall Aziz + 5 more

The objective of this study was to compare the diagnostic accuracy and internal consistency of GPT-4o (Generative Pre-Trained Transformer-4 omni), Perplexity AI (artificial intelligence), and OpenEvidence when applied to text-based, specialty-level radiology board questions. A total of 161 text-based multiple-choice questions from the American College of Radiology (ACR) Diagnostic Radiology In-Training Examination were administered across three independent runs for each large language model (LLM). Questions containing images were excluded. All three models were accessed through their respective public Web interfaces. A final answer was assigned to each model based on majority vote across the three runs (two out of three). If all three responses differed, the third (last) response was selected. Our selected answer was then compared with the ACR reference key. Internal consistency as well as agreement between each model's final answer and the ACR reference key was assessed using Cohen's kappa. In addition, descriptive statistics were used to analyze performance by radiology subspecialty. SPSS version 30 was used for all statistical analyses, and P<0.05 were considered statistically significant. Perplexity AI demonstrated the highest agreement with the ACR reference key (κ=0.883, P<0.001), followed by OpenEvidence (κ=0.858, P<0.001), and GPT-4o (κ=0.709, P<0.001). All models showed high internal consistency; however OpenEvidence was the only LLM to demonstrate absolute internal consistency (κ=1.00 for all three runs). Perplexity AI showed the least variability across the 14 radiology subspecialties. Emerging LLMs such as Perplexity AI and OpenEvidence may offer greater diagnostic reliability than general-purpose models in radiology-specific contexts.

  • Research Article
  • Cite Count Icon 39
  • 10.1186/s12909-024-06309-x
Performance of ChatGPT and Bard on the medical licensing examinations varies across different cultures: a comparison study
  • Nov 26, 2024
  • BMC Medical Education
  • Yikai Chen + 8 more

BackgroundThis study aimed to evaluate the performance of GPT-3.5, GPT-4, GPT-4o and Google Bard on the United States Medical Licensing Examination (USMLE), the Professional and Linguistic Assessments Board (PLAB), the Hong Kong Medical Licensing Examination (HKMLE) and the National Medical Licensing Examination (NMLE).MethodsThis study was conducted in June 2023. Four LLMs (Large Language Models) (GPT-3.5, GPT-4, GPT-4o and Google Bard) were applied to four medical standardized tests (USMLE, PLAB, HKMLE and NMLE). All questions are multiple-choice questions and were sourced from the question banks of these examinations.ResultsIn USMLE step 1, step 2CK and Step 3, there are accuracy rates of 91.5%, 94.2% and 92.7% provided from GPT-4o, 93.2%, 95.0% and 92.0% provided from GPT-4, 65.6%, 71.6% and 68.5% provided from GPT-3.5, and 64.3%, 55.6%, 58.1% from Google Bard, respectively. In PLAB, HKMLE and NMLE, GPT-4o scored 93.3%, 91.7% and 84.9%, GPT-4 scored 86.7%, 89.6% and 69.8%, GPT-3.5 scored 80.0%, 68.1% and 60.4%, and Google Bard scored 54.2%, 71.7% and 61.3%. There was significant difference in the accuracy rates of four LLMs in the four medical licensing examinations.ConclusionGPT-4o performed better in the medical licensing examinations than other three LLMs. The performance of the four models in the NMLE examination needs further improvement.Clinical trial numberNot applicable.

  • Research Article
  • Cite Count Icon 5
  • 10.1016/j.acra.2025.01.004
Textual Proficiency and Visual Deficiency: A Comparative Study of Large Language Models and Radiologists in MRI Artifact Detection and Correction.
  • May 1, 2025
  • Academic radiology
  • Yasin Celal Gunes + 7 more

Textual Proficiency and Visual Deficiency: A Comparative Study of Large Language Models and Radiologists in MRI Artifact Detection and Correction.

  • Research Article
  • Cite Count Icon 1
  • 10.7546/crabs.2026.01.12
The Performance of Large Language Models in Bone Tumour Imaging: Comparative Analysis with Radiologists Using Text and Image-based Evaluation
  • Jan 28, 2026
  • Proceedings of the Bulgarian Academy of Sciences
  • Eren Çamur + 3 more

Large language models (LLMs) are emerging as transformative tools in radiology, with potential to enhance diagnostic workflows. However, their performance in bone tumour imaging – a domain requiring both knowledge-based reasoning and visual interpretation -- remains unclear. This study compares the diagnostic performance of LLMs with radiologists across text and image-based tasks. In this cross-sectional study, two LLMs and two radiologists (a junior and a senior) were evaluated using fifty text-based multiple-choice questions (MCQs) and fifty radiographs with clinical vignettes from a public dataset. Participants classified lesions as benign or malignant, identified “don't-touch” lesions, and provided the most likely diagnosis. Responses were benchmarked against a reference standard using McNemar's tests. In MCQs, ChatGPT-5 (92.0%) and Gemini 2.5 Pro (90.0%) achieved accuracies comparable to SR (88.0%) and JR (84.0%) (p &gt; 0.05). For benign--malignant classification, LLMs (50.0%, 48.0%) were similar to JR (66.0%) but inferior to SR (94.0%) (p &lt; 0.05). In identifying “don't-touch”' lesions, LLMs (46.0%) matched JR (64.0%) yet underperformed compared to SR (92.0%) (p &lt; 0.05). For specific diagnosis, LLMs showed low accuracy (38.0%, 30.0%) versus JR (60.0%) and SR (86.0%) (p &lt; 0.01). LLMs may serve as useful adjuncts for clinicians and radiologists in text-based tasks and in distinguishing between benign and malignant bone tumours. However, their diagnostic accuracy remains limited.

  • Research Article
  • Cite Count Icon 15
  • 10.1308/rcsann.2024.0023
The performance of large language models in intercollegiate Membership of the Royal College of Surgeons examination.
  • Mar 6, 2024
  • Annals of the Royal College of Surgeons of England
  • J Chan + 2 more

Large language models (LLM), such as Chat Generative Pre-trained Transformer (ChatGPT) and Bard utilise deep learning algorithms that have been trained on a massive data set of text and code to generate human-like responses. Several studies have demonstrated satisfactory performance on postgraduate examinations, including the United States Medical Licensing Examination. We aimed to evaluate artificial intelligence performance in Part A of the intercollegiate Membership of the Royal College of Surgeons (MRCS) examination. The MRCS mock examination from Pastest, a commonly used question bank for examinees, was used to assess the performance of three LLMs: GPT-3.5, GPT 4.0 and Bard. Three hundred mock questions were input into the three LLMs, and the responses provided by the LLMs were recorded and analysed. The pass mark was set at 70%. The overall accuracies for GPT-3.5, GPT 4.0 and Bard were 67.33%, 71.67% and 65.67%, respectively (p = 0.27). The performances of GPT-3.5, GPT 4.0 and Bard in Applied Basic Sciences were 68.89%, 72.78% and 63.33% (p = 0.15), respectively. Furthermore, the three LLMs obtained correct answers in 65.00%, 70.00% and 69.17% of the Principles of Surgery in General questions (p = 0.67). There were no differences in performance in the overall and subcategories among the three LLMs. Our findings demonstrated satisfactory performance for all three LLMs in the MRCS Part A examination, with GPT 4.0 the only LLM that achieved the pass mark set.

  • Research Article
  • Cite Count Icon 18
  • 10.4274/dir.2024.242876
Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5th edition.
  • Sep 9, 2024
  • Diagnostic and interventional radiology (Ankara, Turkey)
  • Yasin Celal Güneş + 3 more

This study aimed to evaluate the performance of large language models (LLMs) and multimodal LLMs in interpreting the Breast Imaging Reporting and Data System (BI-RADS) categories and providing clinical management recommendations for breast radiology in text-based and visual questions. This cross-sectional observational study involved two steps. In the first step, we compared ten LLMs (namely ChatGPT 4o, ChatGPT 4, ChatGPT 3.5, Google Gemini 1.5 Pro, Google Gemini 1.0, Microsoft Copilot, Perplexity, Claude 3.5 Sonnet, Claude 3 Opus, and Claude 3 Opus 200K), general radiologists, and a breast radiologist using 100 text-based multiple-choice questions (MCQs) related to the BI-RADS Atlas 5th edition. In the second step, we assessed the performance of five multimodal LLMs (ChatGPT 4o, ChatGPT 4V, Claude 3.5 Sonnet, Claude 3 Opus, and Google Gemini 1.5 Pro) in assigning BI-RADS categories and providing clinical management recommendations on 100 breast ultrasound images. The comparison of correct answers and accuracy by question types was analyzed using McNemar's and chi-squared tests. Management scores were analyzed using the Kruskal- Wallis and Wilcoxon tests. Claude 3.5 Sonnet achieved the highest accuracy in text-based MCQs (90%), followed by ChatGPT 4o (89%), outperforming all other LLMs and general radiologists (78% and 76%) (P < 0.05), except for the Claude 3 Opus models and the breast radiologist (82%) (P > 0.05). Lower-performing LLMs included Google Gemini 1.0 (61%) and ChatGPT 3.5 (60%). Performance across different categories of showed no significant variation among LLMs or radiologists (P > 0.05). For breast ultrasound images, Claude 3.5 Sonnet achieved 59% accuracy, significantly higher than other multimodal LLMs (P < 0.05). Management recommendations were evaluated using a 3-point Likert scale, with Claude 3.5 Sonnet scoring the highest (mean: 2.12 ± 0.97) (P < 0.05). Accuracy varied significantly across BI-RADS categories, except Claude 3 Opus (P < 0.05). Gemini 1.5 Pro failed to answer any BI-RADS 5 questions correctly. Similarly, ChatGPT 4V failed to answer any BI-RADS 1 questions correctly, making them the least accurate in these categories (P < 0.05). Although LLMs such as Claude 3.5 Sonnet and ChatGPT 4o show promise in text-based BI-RADS assessments, their limitations in visual diagnostics suggest they should be used cautiously and under radiologists' supervision to avoid misdiagnoses. This study demonstrates that while LLMs exhibit strong capabilities in text-based BI-RADS assessments, their visual diagnostic abilities are currently limited, necessitating further development and cautious application in clinical practice.

  • Abstract
  • Cite Count Icon 3
  • 10.1182/blood-2023-185854
Evaluating the Performance of Large Language Models in Hematopoietic Stem Cell Transplantation Decision Making
  • Nov 2, 2023
  • Blood
  • Ivan Civettini + 14 more

Evaluating the Performance of Large Language Models in Hematopoietic Stem Cell Transplantation Decision Making

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant