Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Performance of Large Language Models on a Neurology Board–Style Examination

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Recent advancements in large language models (LLMs) have shown potential in a wide array of applications, including health care. While LLMs showed heterogeneous results across specialized medical board examinations, the performance of these models in neurology board examinations remains unexplored. To assess the performance of LLMs on neurology board-style examinations. This cross-sectional study was conducted between May 17 and May 31, 2023. The evaluation utilized a question bank approved by the American Board of Psychiatry and Neurology and was validated with a small question cohort by the European Board for Neurology. All questions were categorized into lower-order (recall, understanding) and higher-order (apply, analyze, synthesize) questions based on the Bloom taxonomy for learning and assessment. Performance by LLM ChatGPT versions 3.5 (LLM 1) and 4 (LLM 2) was assessed in relation to overall scores, question type, and topics, along with the confidence level and reproducibility of answers. Overall percentage scores of 2 LLMs. LLM 2 significantly outperformed LLM 1 by correctly answering 1662 of 1956 questions (85.0%) vs 1306 questions (66.8%) for LLM 1. Notably, LLM 2's performance was greater than the mean human score of 73.8%, effectively achieving near-passing and passing grades in the neurology board examination. LLM 2 outperformed human users in behavioral, cognitive, and psychological-related questions and demonstrated superior performance to LLM 1 in 6 categories. Both LLMs performed better on lower-order than higher-order questions, with LLM 2 excelling in both lower-order and higher-order questions. Both models consistently used confident language, even when providing incorrect answers. Reproducible answers of both LLMs were associated with a higher percentage of correct answers than inconsistent answers. Despite the absence of neurology-specific training, LLM 2 demonstrated commendable performance, whereas LLM 1 performed slightly below the human average. While higher-order cognitive tasks were more challenging for both models, LLM 2's results were equivalent to passing grades in specialized neurology examinations. These findings suggest that LLMs could have significant applications in clinical neurology and health care with further refinements.

Similar Papers
  • Research Article
  • Cite Count Icon 7
  • 10.1177/10225536241268789
Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination.
  • Jan 1, 2025
  • Journal of orthopaedic surgery (Hong Kong)
  • Andrew Y Xu + 9 more

Large language models (LLMs) have many clinical applications. However, the comparative performance of different LLMs on orthopedic board style questions remains largely unknown. Three LLMs, OpenAI's GPT-4 and GPT-3.5, and Google Bard, were tested on 189 official 2022 Orthopedic In-Training Examination (OITE) questions. Comparative analyses were conducted to assess their performance against orthopedic resident scores and on higher-order, image-associated, and subject category-specific questions. GPT-4 surpassed the passing threshold for the 2022 OITE, performing at the level of PGY-3 to PGY-5 (p = .149, p = .502, and p = .818, respectively) and outperforming GPT-3.5 and Bard (p < .001 and p = .001, respectively). While GPT-3.5 and Bard did not meet the passing threshold for the exam, GPT-3.5 performed at the level of PGY-1 to PGY-2 (p = .368 and p = .019, respectively) and Bard performed at the level of PGY-1 to PGY-3 (p = .440, .498, and 0.036, respectively). GPT-4 outperformed both Bard and GPT-3.5 on image-associated (p = .003 and p < .001, respectively) and higher-order questions (p < .001). Among the 11 subject categories, all models performed similarly regardless of the subject matter. When individual LLM performance on higher-order questions was assessed, no significant differences were found compared to performance on first order questions (GPT-4 p = .139, GPT-3.5 p = .124, Bard p = .319). Finally, when individual model performance was assessed on image-associated questions, only GPT-3.5 performed significantly worse compared to performance on non-image-associated questions (p = .045). The AI-based LLM GPT-4, exhibits a robust ability to correctly answer a diverse range of OITE questions, exceeding the minimum score for the 2022 OITE, and outperforming predecessor GPT-3.5 and Google Bard.

  • Research Article
  • Cite Count Icon 11
  • 10.1287/ijds.2023.0007
How Can IJDS Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
  • Apr 1, 2023
  • INFORMS Journal on Data Science
  • Galit Shmueli + 7 more

How Can <i>IJDS</i> Authors, Reviewers, and Editors Use (and Misuse) Generative AI?

  • Research Article
  • Cite Count Icon 18
  • 10.2196/64284
Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis
  • Jan 16, 2025
  • JMIR Medical Education
  • Boxiong Wei

BackgroundArtificial intelligence advancements have enabled large language models to significantly impact radiology education and diagnostic accuracy.ObjectiveThis study evaluates the performance of mainstream large language models, including GPT-4, Claude, Bard, Tongyi Qianwen, and Gemini Pro, in radiology board exams.MethodsA comparative analysis of 150 multiple-choice questions from radiology board exams without images was conducted. Models were assessed on their accuracy for text-based questions and were categorized by cognitive levels and medical specialties using χ2 tests and ANOVA.ResultsGPT-4 achieved the highest accuracy (83.3%, 125/150), significantly outperforming all other models. Specifically, Claude achieved an accuracy of 62% (93/150; P<.001), Bard 54.7% (82/150; P<.001), Tongyi Qianwen 70.7% (106/150; P=.009), and Gemini Pro 55.3% (83/150; P<.001). The odds ratios compared to GPT-4 were 0.33 (95% CI 0.18‐0.60) for Claude, 0.24 (95% CI 0.13‐0.44) for Bard, and 0.25 (95% CI 0.14‐0.45) for Gemini Pro. Tongyi Qianwen performed relatively well with an accuracy of 70.7% (106/150; P=0.02) and had an odds ratio of 0.48 (95% CI 0.27‐0.87) compared to GPT-4. Performance varied across question types and specialties, with GPT-4 excelling in both lower-order and higher-order questions, while Claude and Bard struggled with complex diagnostic questions.ConclusionsGPT-4 and Tongyi Qianwen show promise in medical education and training. The study emphasizes the need for domain-specific training datasets to enhance large language models’ effectiveness in specialized fields like radiology.

  • Research Article
  • Cite Count Icon 4
  • 10.1038/s41698-025-00916-7
Evaluating the performance of large language & visual-language models in cervical cytology screening
  • May 23, 2025
  • npj Precision Oncology
  • Qi Hong + 15 more

Large language models (LLMs) and large visual-language models (LVLMs) have exhibited near-human levels of knowledge, image comprehension, and reasoning abilities, and their performance has undergone evaluation in some healthcare domains. However, a systematic evaluation of their capabilities in cervical cytology screening has yet to be conducted. Here, we constructed CCBench, a benchmark dataset dedicated to the evaluation of LLMs and LVLMs in cervical cytology screening, and developed a GPT-based semi-automatic evaluation pipeline to assess the performance of six LLMs (GPT-4, Bard, Claude-2.0, LLaMa-2, Qwen-Max, and ERNIE-Bot-4.0) and five LVLMs (GPT-4V, Gemini, LLaVA, Qwen-VL, and ViLT) on this dataset. CCBench comprises 773 question-answer (QA) pairs and 420 visual-question-answer (VQA) triplets, making it the first dataset in cervical cytology to include both QA and VQA data. We found that LLMs and LVLMs demonstrate promising accuracy and specialization in cervical cytology screening. GPT-4 achieved the best performance on the QA dataset, with an accuracy of 70.5% for close-ended questions and average expert evaluation score of 6.9/10 for open-ended questions. On the VQA dataset, Gemini achieved the highest accuracy for close-ended questions at 67.8%, while GPT-4V attained the highest expert evaluation score of 6.1/10 for open-ended questions. Besides, LLMs and LVLMs revealed varying abilities in answering questions across different topics and difficulty levels. However, their performance remains inferior to the expertise exhibited by cytopathology professionals, and the risk of generating misinformation could lead to potential harm. Therefore, substantial improvements are required before these models can be reliably deployed in clinical practice.

  • Research Article
  • Cite Count Icon 5
  • 10.1055/a-2437-2067
Large language models (LLMs) in radiology exams for medical students: Performance and consequences.
  • Nov 4, 2024
  • RoFo : Fortschritte auf dem Gebiete der Rontgenstrahlen und der Nuklearmedizin
  • Jennifer Gotta + 18 more

The evolving field of medical education is being shaped by technological advancements, including the integration of Large Language Models (LLMs) like ChatGPT. These models could be invaluable resources for medical students, by simplifying complex concepts and enhancing interactive learning by providing personalized support. LLMs have shown impressive performance in professional examinations, even without specific domain training, making them particularly relevant in the medical field. This study aims to assess the performance of LLMs in radiology examinations for medical students, thereby shedding light on their current capabilities and implications.This study was conducted using 151 multiple-choice questions, which were used for radiology exams for medical students. The questions were categorized by type and topic and were then processed using OpenAI's GPT-3.5 and GPT- 4 via their API, or manually put into Perplexity AI with GPT-3.5 and Bing. LLM performance was evaluated overall, by question type and by topic.GPT-3.5 achieved a 67.6% overall accuracy on all 151 questions, while GPT-4 outperformed it significantly with an 88.1% overall accuracy (p<0.001). GPT-4 demonstrated superior performance in both lower-order and higher-order questions compared to GPT-3.5, Perplexity AI, and medical students, with GPT-4 particularly excelling in higher-order questions. All GPT models would have successfully passed the radiology exam for medical students at our university.In conclusion, our study highlights the potential of LLMs as accessible knowledge resources for medical students. GPT-4 performed well on lower-order as well as higher-order questions, making ChatGPT-4 a potentially very useful tool for reviewing radiology exam questions. Radiologists should be aware of ChatGPT's limitations, including its tendency to confidently provide incorrect responses. · ChatGPT demonstrated remarkable performance, achieving a passing grade on a radiology examination for medical students that did not include image questions.. · GPT-4 exhibits significantly improved performance compared to its predecessors GPT-3.5 and Perplexity AI with 88% of questions answered correctly.. · Radiologists as well as medical students should be aware of ChatGPT's limitations, including its tendency to confidently provide incorrect responses.. · Gotta J, Le Hong QA, Koch V et al. Large language models (LLMs) in radiology exams for medical students: Performance and consequences. Rofo 2025; 197: 1057-1067.

  • Abstract
  • Cite Count Icon 3
  • 10.1182/blood-2023-185854
Evaluating the Performance of Large Language Models in Hematopoietic Stem Cell Transplantation Decision Making
  • Nov 2, 2023
  • Blood
  • Ivan Civettini + 14 more

Evaluating the Performance of Large Language Models in Hematopoietic Stem Cell Transplantation Decision Making

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 67
  • 10.1001/jamanetworkopen.2024.17641
Performance of Large Language Models on Medical Oncology Examination Questions
  • Jun 18, 2024
  • JAMA Network Open
  • Jack B Longwell + 7 more

Large language models (LLMs) recently developed an unprecedented ability to answer questions. Studies of LLMs from other fields may not generalize to medical oncology, a high-stakes clinical setting requiring rapid integration of new information. To evaluate the accuracy and safety of LLM answers on medical oncology examination questions. This cross-sectional study was conducted between May 28 and October 11, 2023. The American Society of Clinical Oncology (ASCO) Oncology Self-Assessment Series on ASCO Connection, the European Society of Medical Oncology (ESMO) Examination Trial questions, and an original set of board-style medical oncology multiple-choice questions were presented to 8 LLMs. The primary outcome was the percentage of correct answers. Medical oncologists evaluated the explanations provided by the best LLM for accuracy, classified the types of errors, and estimated the likelihood and extent of potential clinical harm. Proprietary LLM 2 correctly answered 125 of 147 questions (85.0%; 95% CI, 78.2%-90.4%; P < .001 vs random answering). Proprietary LLM 2 outperformed an earlier version, proprietary LLM 1, which correctly answered 89 of 147 questions (60.5%; 95% CI, 52.2%-68.5%; P < .001), and the best open-source LLM, Mixtral-8x7B-v0.1, which correctly answered 87 of 147 questions (59.2%; 95% CI, 50.0%-66.4%; P < .001). The explanations provided by proprietary LLM 2 contained no or minor errors for 138 of 147 questions (93.9%; 95% CI, 88.7%-97.2%). Incorrect responses were most commonly associated with errors in information retrieval, particularly with recent publications, followed by erroneous reasoning and reading comprehension. If acted upon in clinical practice, 18 of 22 incorrect answers (81.8%; 95% CI, 59.7%-94.8%) would have a medium or high likelihood of moderate to severe harm. In this cross-sectional study of the performance of LLMs on medical oncology examination questions, the best LLM answered questions with remarkable performance, although errors raised safety concerns. These results demonstrated an opportunity to develop and evaluate LLMs to improve health care clinician experiences and patient care, considering the potential impact on capabilities and safety.

  • Research Article
  • Cite Count Icon 8
  • 10.1186/s12859-025-06081-9
Comparative Assessment of Protein Large Language Models for Enzyme Commission Number Prediction
  • Feb 27, 2025
  • BMC Bioinformatics
  • João Capela + 5 more

Background: Protein large language models (LLM) have been used to extract representations of enzyme sequences to predict their function, which is encoded by enzyme commission (EC) numbers. However, a comprehensive comparison of different LLMs for this task is still lacking, leaving questions about their relative performance. Moreover, protein sequence alignments (e.g. BLASTp or DIAMOND) are often combined with machine learning models to assign EC numbers from homologous enzymes, thus compensating for the shortcomings of these models’ predictions. In this context, LLMs and sequence alignment methods have not been extensively compared as individual predictors, raising unaddressed questions about LLMs’ performance and limitations relative to the alignment methods. In this study, we set out to assess the performance of ESM2, ESM1b, and ProtBERT language models in their ability to predict EC numbers, comparing them with BLASTp, against each other and against models that rely on one-hot encodings of amino acid sequences.Results: Our findings reveal that combining these LLMs with fully connected neural networks surpasses the performance of deep learning models that rely on one-hot encodings. Moreover, although BLASTp provided marginally better results overall, DL models provide results that complement BLASTp’s, revealing that LLMs better predict certain EC numbers while BLASTp excels in predicting others. The ESM2 stood out as the best model among the LLMs tested, providing more accurate predictions on difficult annotation tasks and for enzymes without homologs.Conclusions: Crucially, this study demonstrates that LLMs still have to be improved to become the gold standard tool over BLASTp in mainstream enzyme annotation routines. On the other hand, LLMs can provide good predictions for more difficult-to-annotate enzymes, particularly when the identity between the query sequence and the reference database falls below 25%. Our results reinforce the claim that BLASTp and LLM models complement each other and can be more effective when used together.

  • Research Article
  • 10.1177/17531934261436305
Performance and reliability of large language models on the European Board of Hand Surgery examination: a multi-model evaluation study.
  • Apr 21, 2026
  • The Journal of hand surgery, European volume
  • Ibrahim Güler + 5 more

Artificial intelligence (AI) has demonstrated transformative potential in medical education and assessment, with large language models achieving competitive results across multiple high-stakes examinations. In this study, we evaluated the performance and inter-run reliability of 10 widely adopted large language models (LLMs) on the European Board of Hand Surgery written examination. Ten LLMs were assessed on the complete 300-item European Board of Hand Surgery written examination using standardized zero-shot prompting. The models included five proprietary systems (GPT-5 Pro, Claude Sonnet 4.5, Gemini 2.5 Pro, Grok-4 and ERNIE 4.5 Turbo) and five open-source architectures (DeepSeek V3.2, Qwen3 Max, Mistral Medium 3.1, Llama 3.3 and Falcon H1). Each LLM completed five independent runs, producing 15000 answers analysed for mean accuracy, 95% confidence intervals and inter-run reliability using Cohen's kappa (κ). Mean accuracy across the LLMs ranged from 72 to 85%, corresponding to total European Board of Hand Surgery scores between 131 and 211 points. Seven of the 10 LLMs reached or exceeded the illustrative pass threshold of 75%, equivalent to 150 of 300 points. Proprietary systems showed consistently higher mean accuracy than open-source systems. The highest-performing LLM (GPT-5 Pro) achieved 85% accuracy with a 95% confidence interval of 84 to 86% and a mean inter-run reliability measured by Cohen's κ of 0.739. The overall reliability across the LLMs was 0.821. Contemporary LLMs show robust and reproducible performance on a complex surgical certification examination, with proprietary architectures tending to outperform open-source counterparts. Although several models reached or exceeded an illustrative pass threshold, persistent gaps in subspecialty knowledge remain such as congenital anomalies and complex reconstructions. Therefore, LLMs may assist in structured learning and examination preparation but require specialist oversight and remain unsuitable for independent subspecialty decision-making. Not applicable.

  • Research Article
  • 10.1016/j.jclinepi.2026.112221
Large language models show promising performance for some systematic review tasks but call for cautious implementation: a systematic review.
  • Mar 12, 2026
  • Journal of clinical epidemiology
  • Florian Laignelot + 9 more

With the exponential growth of biomedical literature, the challenge of conducting systematic reviews is becoming increasingly burdensome. We aimed to evaluate the performance of large language models (LLMs) in the automation of some or all steps of systematic reviews and meta-analyses. In this systematic review, we searched PubMed, Embase, the Cochrane Library and preprint platforms up to January 14, 2025. We included any studies assessing the performance of LLMs (eg, generative pre-trained transformer [GPT], Claude, Mistral) in any step of the systematic review process. Pairs of reviewers independently extracted data and assessed risk of bias. We conducted analyses using median (interquartile range [IQR]) for positive (PPA) and negative percent agreements (NPA), respectively, analogous to sensitivity and specificity, between LLMs and human reviewers. From 3889 unique references, we included 63 studies of which 52 reporting performance metrics for a total of 148 LLM performance assessments. Most assessments concerned GPT models (n = 114, 77%). The most frequently evaluated tasks were title and abstract screening (n = 78, 53%), data extraction (n = 23, 16%), and full-text screening (n = 20, 14%). For title and abstract screening, overall median PPA was 0.92 (IQR 0.69-0.98) and median NPA was 0.89 (0.72-0.95). For full-text screening, the overall median PPA was 0.93 (0.87-1.00) and median NPA was 0.92 (0.78-0.97). Late-generation LLMs released after GPT-4 seemed to provide higher performance than earlier models. For other tasks, authors reported overall good performances, but variability of performance metrics precluded complete quantitative synthesis. Global accuracy for data extraction tasks ranged from 0.36 to 1.00, with a median accuracy of 0.95 (IQR 0.91-0.97, n = 11). For the "risk of bias assessment" task, accuracy ranged from 0.44 to 0.90 (median = 0.62, IQR 0.53-0.76, n = 6). The performance of LLMs, particularly newer generations, shows promise in automating some repetitive steps of systematic reviews such as screening. However, their successful integration will require appropriate safeguards and careful implementation. Systematic reviews are one of the most reliable ways to answer medical and public health questions. They bring together all available studies on a topic and help clinicians and policymakers make informed decisions. However, producing a high-quality systematic review takes a lot of time and effort. Whole teams of researchers spend months screening thousands of articles, extracting data, and double-checking results. With little more than a million of new publications every year, keeping reviews up to date is becoming increasingly difficult. LLMs, such as ChatGPT, may help reduce this workload. These tools can read and summarize text and might assist with repetitive tasks like selecting relevant studies or extracting information from articles. But it is still unclear how reliable these tools are for research purposes. This is the first systematic review to assess LLMs' performance to facilitate systematic reviews. We sought to review all studies that tested LLMs in the different steps of systematic reviews and found 63 studies evaluating how well these tools performed compared with human reviewers. Overall, LLMs showed good agreement with humans for tasks such as screening titles and abstracts, and full-text articles. Newer models seemed to perform better than older ones. However, performance was more variable for complex tasks that require interpretation, such as extracting detailed data or assessing methodological quality. Our findings suggest that LLMs could help researchers work faster and make systematic reviews more efficient. However, they are not ready to replace human judgment. These tools can make mistakes, produce inconsistent results, or generate inaccurate information if not carefully supervised. In practice, LLMs should be used as assistants rather than substitutes. With proper safeguards, transparent reporting, and human oversight, they may become valuable tools to support evidence-based healthcare and help keep research up to date.

  • Research Article
  • 10.1007/s00330-025-12211-x
Predicting molecular types of adult-type diffuse gliomas based on MRI reports with large language models.
  • Dec 22, 2025
  • European radiology
  • Pae Sun Suh + 18 more

To evaluate the performance of large language models (LLMs) in predicting molecular types of adult-type diffuse gliomas according to the 2021 WHO classification using MRI radiology reports. This retrospective study included 2169 patients diagnosed with adult-type diffuse gliomas (294 oligodendrogliomas, 295 IDH-mutant astrocytomas, and 1580 IDH-wildtype glioblastomas) between July 2005 and March 2024 from four hospitals in Asia and Europe. Seven proprietary and open-source LLMs were assessed: GPT-4o-mini, GPT-4.1-mini, Llama 3.1 8B, Llama 3.1 70B, Qwen2.5 7B, Deepseek-r1 8B, and Mistal 7B. The performance of LLMs in classifying molecular types was compared based on the provision of relevant knowledge of glioma imaging findings (knowledge-based vs. naïve prompt). The impact of radiologists' subspecialization in neuro-oncology, report quality, and reporting language on LLMs' performance was also evaluated. LLMs achieved significantly higher (naïve vs. knowledge-based; GPT-4o-mini, 77.0% vs. 79.1%, p < 0.001; Qwen2.5 7B, 75.9% vs. 79.5%, p < 0.001; Deepseek-r1 8B, 66.0% vs. 73.2%, p < 0.001) or comparable accuracy (GPT-4.1-mini, 78.7% vs. 78.6%; Llama 3.1 70B, 78.0% vs. 78.1%; Mistral 7B, 58.4% vs. 57.4%) using knowledge-based prompt compared to naïve prompt, except for Llama 3.1 8B (65.4% vs. 44.6%, p < 0.001). Differences in accuracy were more pronounced in smaller-sized LLMs. Additionally, the accuracy was significantly higher with reports by neuro-oncology specialists and high-quality reports in all LLMs (p < 0.001). LLMs may provide preoperative information on the tumor types of adult-type diffuse gliomas from MRI reports by providing relevant knowledge in the prompt. Informative and descriptive reports could further enhance LLMs' performance. Question Our study aimed to evaluate large language models' (LLMs) ability to efficiently predict molecular types of adult-type diffuse gliomas according to the 2021 WHO classification. Findings Larger models generally showed better accuracy and were less sensitive to domain-specific knowledge. Their performance improved when using high-quality, longer reports or reports by neuro-oncology specialists. Clinical relevance These findings highlight the potential role of LLMs in predicting glioma molecular types, underscoring the importance of informative and descriptive reports in enhancing their performance.

  • Research Article
  • Cite Count Icon 4
  • 10.2196/64452
Performance of Large Language Models in Numerical Versus Semantic Medical Knowledge: Cross-Sectional Benchmarking Study on Evidence-Based Questions and Answers
  • Jul 14, 2025
  • Journal of Medical Internet Research
  • Eden Avnat + 14 more

BackgroundClinical problem-solving requires processing of semantic medical knowledge, such as illness scripts, and numerical medical knowledge of diagnostic tests for evidence-based decision-making. As large language models (LLMs) show promising results in many aspects of language-based clinical practice, their ability to generate nonlanguage evidence-based answers to clinical questions is inherently limited by tokenization.ObjectiveThis study aimed to evaluate LLMs’ performance on two question types: numeric (correlating findings) and semantic (differentiating entities), while examining differences within and between LLMs in medical aspects and comparing their performance to humans.MethodsTo generate straightforward multichoice questions and answers (Q and As) based on evidence-based medicine (EBM), we used a comprehensive medical knowledge graph (containing data from more than 50,000 peer-reviewed studies) and created the EBM questions and answers (EBMQAs). EBMQA comprises 105,222 Q and As, categorized by medical topics (eg, medical disciplines) and nonmedical topics (eg, question length), and classified into numerical or semantic types. We benchmarked a dataset of 24,000 Q and As on two state-of-the-art LLMs, GPT-4 (OpenAI) and Claude 3 Opus (Anthropic). We evaluated the LLM’s accuracy on semantic and numerical question types and according to sublabeled topics. In addition, we examined the question-answering rate of LLMs by enabling them to choose to abstain from responding to questions. For validation, we compared the results for 100 unrelated numerical EBMQA questions between six human medical experts and the two language models.ResultsIn an analysis of 24,542 Q and As, Claude 3 and GPT-4 performed better on semantic Q and As (68.7%, n=1593 and 68.4%, n=1709), respectively. Then on numerical Q and As (61.3%, n=8583 and 56.7%, n=12,038), respectively, with Claude 3 outperforming GPT-4 in numeric accuracy (P<.001). A median accuracy gap of 7% (IQR 5%‐10%) was observed between the best and worst sublabels per topic, with different LLMs excelling in different sublabels. Focusing on Medical Discipline sublabels, Claude 3 performed well in neoplastic disorders but struggled with genitourinary disorders (69%, n=676 vs 58%, n=464; P<.0001), while GPT-4 excelled in cardiovascular disorders but struggled with neoplastic disorders (60%, n=1076 vs 53%, n=704; P=.0002). Furthermore, humans (82.3%, n=82.3) surpassed both Claude 3 (64.3%, n=64.3; P<.001) and GPT-4 (55.8%, n=55.8; P<.001) in the validation test. Spearman correlation between question-answering and accuracy rate in both Claude 3 and GPT-4 was insignificant (ρ=0.12, P=.69; ρ=0.43, P=.13).ConclusionsBoth LLMs excelled more in semantic than numerical Q and As, with Claude 3 surpassing GPT-4 in numerical Q and As. However, both LLMs showed inter- and intramodel gaps in different medical aspects and remained inferior to humans. In addition, their ability to respond or abstain from answering a question does not reliably predict how accurately they perform when they do attempt to answer questions. Thus, their medical advice should be addressed carefully.

  • Research Article
  • Cite Count Icon 18
  • 10.4274/dir.2024.242876
Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5th edition.
  • Sep 9, 2024
  • Diagnostic and interventional radiology (Ankara, Turkey)
  • Yasin Celal Güneş + 3 more

This study aimed to evaluate the performance of large language models (LLMs) and multimodal LLMs in interpreting the Breast Imaging Reporting and Data System (BI-RADS) categories and providing clinical management recommendations for breast radiology in text-based and visual questions. This cross-sectional observational study involved two steps. In the first step, we compared ten LLMs (namely ChatGPT 4o, ChatGPT 4, ChatGPT 3.5, Google Gemini 1.5 Pro, Google Gemini 1.0, Microsoft Copilot, Perplexity, Claude 3.5 Sonnet, Claude 3 Opus, and Claude 3 Opus 200K), general radiologists, and a breast radiologist using 100 text-based multiple-choice questions (MCQs) related to the BI-RADS Atlas 5th edition. In the second step, we assessed the performance of five multimodal LLMs (ChatGPT 4o, ChatGPT 4V, Claude 3.5 Sonnet, Claude 3 Opus, and Google Gemini 1.5 Pro) in assigning BI-RADS categories and providing clinical management recommendations on 100 breast ultrasound images. The comparison of correct answers and accuracy by question types was analyzed using McNemar's and chi-squared tests. Management scores were analyzed using the Kruskal- Wallis and Wilcoxon tests. Claude 3.5 Sonnet achieved the highest accuracy in text-based MCQs (90%), followed by ChatGPT 4o (89%), outperforming all other LLMs and general radiologists (78% and 76%) (P < 0.05), except for the Claude 3 Opus models and the breast radiologist (82%) (P > 0.05). Lower-performing LLMs included Google Gemini 1.0 (61%) and ChatGPT 3.5 (60%). Performance across different categories of showed no significant variation among LLMs or radiologists (P > 0.05). For breast ultrasound images, Claude 3.5 Sonnet achieved 59% accuracy, significantly higher than other multimodal LLMs (P < 0.05). Management recommendations were evaluated using a 3-point Likert scale, with Claude 3.5 Sonnet scoring the highest (mean: 2.12 ± 0.97) (P < 0.05). Accuracy varied significantly across BI-RADS categories, except Claude 3 Opus (P < 0.05). Gemini 1.5 Pro failed to answer any BI-RADS 5 questions correctly. Similarly, ChatGPT 4V failed to answer any BI-RADS 1 questions correctly, making them the least accurate in these categories (P < 0.05). Although LLMs such as Claude 3.5 Sonnet and ChatGPT 4o show promise in text-based BI-RADS assessments, their limitations in visual diagnostics suggest they should be used cautiously and under radiologists' supervision to avoid misdiagnoses. This study demonstrates that while LLMs exhibit strong capabilities in text-based BI-RADS assessments, their visual diagnostic abilities are currently limited, necessitating further development and cautious application in clinical practice.

  • Research Article
  • Cite Count Icon 21
  • 10.1016/j.joms.2024.11.007
Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential
  • Mar 1, 2025
  • Journal of Oral and Maxillofacial Surgery
  • Reema Mahmoud + 5 more

Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential

  • Research Article
  • 10.1186/s12886-026-04767-z
Comparative performance of large language models in answering cornea and cataract surgery questions for resident training.
  • Mar 28, 2026
  • BMC ophthalmology
  • Seonghwan Kim + 6 more

The application of large language models (LLMs) in the medical field has gained increasing popularity; however, their effectiveness in ophthalmology remains uncertain. This study aimed to evaluate the accuracy of responses generated by various deep learning-based LLMs to questions on cataract and corneal diseases and surgeries, and to verify their educational effectiveness by comparing the performances of LLMs with those provided by ophthalmology fellows and residents. Eighty-one multiple-choice questions on corneal diseases and cataract surgeries were developed based on the standard format of the Korean ophthalmology board examination and categorized into three subtypes: recall-type (n = 27), interpretation-type (n = 27), and problem-solving-type (n = 27). The accuracy and appropriateness of commonly used LLMs (ChatGPT-4o, ChatGPT-5, Gemini 3.0 Pro, and Claude Sonnet 4.5) were evaluated and compared with one another, and against the performances of three ophthalmology residents and three corneal fellows. Among the four LLMs, ChatGPT-5 demonstrated the highest overall accuracy (75/81; 92.59%), followed by Gemini 3.0 Pro (73/81; 90.12%) and Claude Sonnet 4.5 (73/81; 90.12%), all of which outperformed ChatGPT-4o (70/81; 86.42%) as well as ophthalmology fellows (86.42 ± 1.23%) and residents (82.30 ± 8.22%). For recall-type questions, ChatGPT-5 achieved the highest accuracy (92.59%), and the other three LLMs (85.19%) outperformed both fellows (82.72 ± 4.28%) and residents (71.60 ± 17.50%). In interpretation-type questions, ChatGPT-5, Gemini 3.0 Pro, and Claude Sonnet 4.5 achieved perfect scores (100%), while ChatGPT-4o (92.59%) performed comparably to fellows (92.59 ± 7.41%) and better than residents (87.65 ± 4.28%). However, in problem-solving-type questions, all LLMs scored relatively lower (ChatGPT-5: 85.19%; others: 81.48%) than the mean performance of fellows (87.65 ± 11.91%). Among the LLMs, ChatGPT-5 demonstrated the highest overall performance; however, no statistically significant differences were observed between the models. Compared with ophthalmology trainees, ChatGPT-5, Gemini 3.0 Pro, and Claude Sonnet 4.5 achieved significantly higher overall scores than the mean trainee performance. Although LLMs showed good performance in recall- and interpretation-type questions, their relatively lower accuracy in problem-solving-type questions suggests that further advances in LLMs are necessary for their reliable use as educational tools in ophthalmology.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant