Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential

Similar Papers
  • Research Article
  • Cite Count Icon 70
  • 10.1016/j.jclinepi.2025.111746
Large language models for conducting systematic reviews: on the rise, but not yet ready for use-a scoping review.
  • May 1, 2025
  • Journal of clinical epidemiology
  • Judith-Lisa Lieberum + 8 more

Machine learning promises versatile help in the creation of systematic reviews (SRs). Recently, further developments in the form of large language models (LLMs) and their application in SR conduct attracted attention. We aimed at providing an overview of LLM applications in SR conduct in health research. We systematically searched MEDLINE, Web of Science, IEEEXplore, ACM Digital Library, Europe PMC (preprints), Google Scholar, and conducted an additional hand search (last search: February 26, 2024). We included scientific articles in English or German, published from April 2021 onwards, building upon the results of a mapping review that has not yet identified LLM applications to support SRs. Two reviewers independently screened studies for eligibility; after piloting, 1 reviewer extracted data, checked by another. Our database search yielded 8054 hits, and we identified 33 articles from our hand search. We finally included 37 articles on LLM support. LLM approaches covered 10 of 13 defined SR steps, most frequently literature search (n = 15, 41%), study selection (n = 14, 38%), and data extraction (n = 11, 30%). The mostly recurring LLM was Generative Pretrained Transformer (GPT) (n = 33, 89%). Validation studies were predominant (n = 21, 57%). In half of the studies, authors evaluated LLM use as promising (n = 20, 54%), one-quarter as neutral (n = 9, 24%) and one-fifth as nonpromising (n = 8, 22%). Although LLMs show promise in supporting SR creation, fully established or validated applications are often lacking. The rapid increase in research on LLMs for evidence synthesis production highlights their growing relevance. Systematic reviews are a crucial tool in health research where experts carefully collect and analyze all available evidence on a specific research question. Creating these reviews is typically time- and resource-intensive, often taking months or even years to complete, as researchers must thoroughly search, evaluate, and synthesize an immense number of scientific studies. For the present article, we conducted a review to understand how new artificial intelligence (AI) tools, specifically large language models (LLMs) like Generative Pretrained Transformer (GPT), can be used to help create systematic reviews in health research. We searched multiple scientific databases and finally found 37 relevant articles. We found that LLMs have been tested to help with various parts of the systematic review process, particularly in 3 main areas: searching scientific literature (41% of studies), selecting relevant studies (38%), and extracting important information from these studies (30%). GPT was the most commonly used LLM, appearing in 89% of the studies. Most of the research (57%) focused on testing whether these AI tools actually work as intended in this context of systematic review production. The results were mixed: about half of the studies found LLMs promising, a quarter were neutral, and one-fifth found them not promising. While LLMs show potential for making the systematic review process more efficient, there is still a lack of fully tested and validated applications. However, the increasing number of studies in this field suggests that these AI tools are becoming increasingly important in creating systematic reviews.

  • Research Article
  • 10.1177/00202940251344491
Exploring the capabilities of large language models in oral and maxillofacial surgery
  • Jun 26, 2025
  • Measurement and Control
  • Sulaiman Khan + 3 more

Oral and Maxillofacial Surgery (OMFS) is a surgical spatiality that serves as a bridge between medicine and dentistry, focusing on the diagnosis and treatment of diseases affecting the mouth, jaw, face, and neck. Large Language Models (LLMs), which first appeared in 2019, are trained in extensive text collections and can process languages with high quality. Although OMFS is a hands-on surgical specialty, LLMs have been increasingly used for patient education, research, and training purposes. This study aimed to explore the capabilities of LLMs in the field of OMFS by investigating the most recent literature. Seven peer-reviewed online repositories including PubMed, Scopus, association for computing machinery (ACM), IEEE, Embase, cumulative index to nursing and allied health literature (CINAHL), and Google Scholar, are selected to download relevant articles. Adhering to the PRISMA-ScR guidelines, we conducted a systematic search across these libraries to select articles that incorporated LLMs into OMFS. The forward and backward reference lists of the included articles were checked to retrieve missing articles. After the final screening process a total of 20 studies are selected for this review process. The selected studies demonstrated the applications of LLMs in OMFS, such as patient education, clinical decision support, and procedural guidance for specific procedures. The study results showed variability in LLM response accuracy and lower accuracy in citation generation, whereas open-ended questions achieved higher accuracy rates. Advanced versions of LLMs, such as ChatGPT4, have shown improved accuracy, and reliability compared with older GPT versions. While some studies reported that LLM responses lacked complete details and exhibited only moderate accuracy. This variability in performance emphasizes the need for the continuous refinement of LLMs and highlights the importance of human oversight in clinical applications. However, there is a need for further refinement, extensive research, and verification by experts.

  • Research Article
  • 10.1007/s10006-026-01514-y
Large language model use in oral and maxillofacial surgery training: a national resident survey.
  • Feb 21, 2026
  • Oral and maxillofacial surgery
  • Nolan Kranc + 7 more

Large language models (LLMs) are advanced artificial intelligence (AI) tools capable of generating human-like text and are increasingly used in education, clinical care, and research. Little is known about their use within oral and maxillofacial surgery (OMFS) training. This study investigates LLM usage trends, perceived value, and educational integration among OMFS residents in the United States. A national, anonymous cross-sectional survey was distributed to OMFS residents via program directors. It gathered demographic data, LLM usage patterns, applications, perceived limitations, and attitudes toward incorporating LLMs into formal education. Eighty-one residents responded, 79.0% (64/81) reported having used an LLM, and of that group, 96.9% (62/64) use ChatGPT. 51.9% (42/81) of respondents used LLMs at least monthly in residency; however, 97.5% (79/81) reported having received no formal LLM education during residency. Residents used LLMs for clinical decision support, board preparation, research, and career planning. Free-text responses revealed a wide spectrum of views. Some advocated for curricular integration and patient education applications, while others questioned the need for formal instruction. Some respondents supported integrating LLMs into curriculums and patient education while others questioned the need for formal instruction. LLMs are used frequently by OMFS residents for a variety of purposes. As AI and LLMs become embedded in healthcare, understanding how OMFS residents interact with LLMs is vital. These findings may guide curriculum development, fostering responsible and effective use of LLMs in surgical training and practice.

  • Abstract
  • Cite Count Icon 3
  • 10.1182/blood-2023-185854
Evaluating the Performance of Large Language Models in Hematopoietic Stem Cell Transplantation Decision Making
  • Nov 2, 2023
  • Blood
  • Ivan Civettini + 14 more

Evaluating the Performance of Large Language Models in Hematopoietic Stem Cell Transplantation Decision Making

  • Research Article
  • Cite Count Icon 1
  • 10.1101/2025.02.27.640661
SensitiveCancerGPT: Leveraging Generative Large Language Model on Structured Omics Data to Optimize Drug Sensitivity Prediction.
  • Mar 3, 2025
  • bioRxiv : the preprint server for biology
  • Shaika Chowdhury + 6 more

The fast accumulation of vast pharmacogenomics data of cancer cell lines provide unprecedented opportunities for drug sensitivity prediction (DSP), a crucial prerequisite for the advancement of precision oncology. Recently, Generative Large Language Models (LLM) have demonstrated performance and generalization prowess across diverse tasks in the field of natural language processing (NLP). However, the structured format of the pharmacogenomics data poses challenge for the utility of LLM in DSP. Therefore, the objective of this study is multi-fold: to adapt prompt engineering for structured pharmacogenomics data toward optimizing LLM's DSP performance, to evaluate LLM's generalization in real-world DSP scenarios, and to compare LLM's DSP performance against that of state-of-the-science baselines. We systematically investigated the capability of the Generative Pre-trained Transformer (GPT) as a DSP model on four publicly available benchmark pharmacogenomics datasets, which are stratified by five cancer tissue types of cell lines and encompass both oncology and non-oncology drugs. Essentially, the predictive landscape of GPT is assessed for effectiveness on the DSP task via four learning paradigms: zero-shot learning, few-shot learning, fine-tuning and clustering pretrained embeddings. To facilitate GPT in seamlessly processing the structured pharmacogenomics data, domain-specific novel prompt engineering is employed by implementing three prompt templates (i.e., Instruction, Instruction-Prefix, Cloze) and integrating pharmacogenomics-related features into the prompt. We validated GPT's performance in diverse real-world DSP scenarios: cross-tissue generalization, blind tests, and analyses of drug-pathway associations and top sensitive/resistant cell lines. Furthermore, we conducted a comparative evaluation of GPT against multiple Transformer-based pretrained models and existing DSP baselines. Extensive experiments on the pharmacogenomics datasets across the five tissue cohorts demonstrate that fine-tuning GPT yields the best DSP performance (28% F1 increase, p-value= 0.0003) followed by clustering pretrained GPT embeddings (26% F1 increase, p-value= 0.0005), outperforming GPT in-context learning (i.e., few-shot). However, GPT in the zero-shot setting had a big F1 gap, resulting in the worst performance. Within the scope of prompt engineering, performance enhancement was achieved by directly instructing GPT about the DSP task and resorting to a concise context format (i.e., instruction-prefix), leading to F1 performance gain of 22% (p-value=0.02); while incorporation of drug-cell line prompt context derived from genomics and/or molecular features further boosted F1 score by 2%. Compared to state-of-the-science DSP baselines, GPT significantly asserted superior mean F1 performance (16% gain, p-value<0.05) on the GDSC dataset. In the cross-tissue analysis, GPT showcased comparable generalizability to the within-tissue performances on the GDSC and PRISM datasets, while statistically significant F1 performance improvements on the CCLE (8%, p-value=0.001) and DrugComb (19%, p-value=0.009) datasets. Evaluation on the challenging blind tests suggests GPT's competitiveness on the CCLE and DrugComb datasets compared to random splitting. Furthermore, analyses of the drug-pathway associations and log probabilities provided valuable insights that align with previous DSP findings. The diverse experiment setups and in-depth analysis underscore the importance of generative LLM, such as GPT, as a viable in silico approach to guide precision oncology. https://github.com/bioIKEA/SensitiveCancerGPT.

  • Abstract
  • 10.1136/jnis-2024-snis.290
E-185 Customized generative pretrained transformer for simplified patient education of carotid angioplasty and stenting: a feasibility study
  • Jul 1, 2024
  • Journal of NeuroInterventional Surgery
  • A Brake + 3 more

Maintaining patient autonomy necessitates a clear understanding of surgical procedures prior to consent. Time constraints, patient literacy, and the complexity of medical terminology pose challenges in conveying this information. Recent...

  • Research Article
  • Cite Count Icon 59
  • 10.1016/j.ijom.2023.09.005
The impact and opportunities of large language models like ChatGPT in oral and maxillofacial surgery: a narrative review
  • Oct 3, 2023
  • International Journal of Oral and Maxillofacial Surgery
  • B Puladi + 5 more

Since its release at the end of 2022, the social response to ChatGPT, a large language model (LLM), has been huge, as it has revolutionized the way we communicate with computers. This review was performed to describe the technical background of LLMs and to provide a review of the current literature on LLMs in the field of oral and maxillofacial surgery (OMS). The PubMed, Scopus, and Web of Science databases were searched for LLMs and OMS. Adjacent surgical disciplines were included to cover the entire literature, and records from Google Scholar and medRxiv were added. Out of the 57 records identified, 37 were included; 31 (84%) were related to GPT-3.5, four (11%) to GPT-4, and two (5%) to both. Current research on LLMs is mainly limited to research and scientific writing, patient information/communication, and medical education. Classic OMS diseases are underrepresented. The current literature related to LLMs in OMS has a limited evidence level. There is a need to investigate the use of LLMs scientifically and systematically in the core areas of OMS. Although LLMs are likely to add value outside the operating room, the use of LLMs raises ethical and medical regulatory issues that must first be addressed.

  • Research Article
  • 10.1016/j.jclinepi.2026.112221
Large language models show promising performance for some systematic review tasks but call for cautious implementation: a systematic review.
  • Mar 12, 2026
  • Journal of clinical epidemiology
  • Florian Laignelot + 9 more

With the exponential growth of biomedical literature, the challenge of conducting systematic reviews is becoming increasingly burdensome. We aimed to evaluate the performance of large language models (LLMs) in the automation of some or all steps of systematic reviews and meta-analyses. In this systematic review, we searched PubMed, Embase, the Cochrane Library and preprint platforms up to January 14, 2025. We included any studies assessing the performance of LLMs (eg, generative pre-trained transformer [GPT], Claude, Mistral) in any step of the systematic review process. Pairs of reviewers independently extracted data and assessed risk of bias. We conducted analyses using median (interquartile range [IQR]) for positive (PPA) and negative percent agreements (NPA), respectively, analogous to sensitivity and specificity, between LLMs and human reviewers. From 3889 unique references, we included 63 studies of which 52 reporting performance metrics for a total of 148 LLM performance assessments. Most assessments concerned GPT models (n = 114, 77%). The most frequently evaluated tasks were title and abstract screening (n = 78, 53%), data extraction (n = 23, 16%), and full-text screening (n = 20, 14%). For title and abstract screening, overall median PPA was 0.92 (IQR 0.69-0.98) and median NPA was 0.89 (0.72-0.95). For full-text screening, the overall median PPA was 0.93 (0.87-1.00) and median NPA was 0.92 (0.78-0.97). Late-generation LLMs released after GPT-4 seemed to provide higher performance than earlier models. For other tasks, authors reported overall good performances, but variability of performance metrics precluded complete quantitative synthesis. Global accuracy for data extraction tasks ranged from 0.36 to 1.00, with a median accuracy of 0.95 (IQR 0.91-0.97, n = 11). For the "risk of bias assessment" task, accuracy ranged from 0.44 to 0.90 (median = 0.62, IQR 0.53-0.76, n = 6). The performance of LLMs, particularly newer generations, shows promise in automating some repetitive steps of systematic reviews such as screening. However, their successful integration will require appropriate safeguards and careful implementation. Systematic reviews are one of the most reliable ways to answer medical and public health questions. They bring together all available studies on a topic and help clinicians and policymakers make informed decisions. However, producing a high-quality systematic review takes a lot of time and effort. Whole teams of researchers spend months screening thousands of articles, extracting data, and double-checking results. With little more than a million of new publications every year, keeping reviews up to date is becoming increasingly difficult. LLMs, such as ChatGPT, may help reduce this workload. These tools can read and summarize text and might assist with repetitive tasks like selecting relevant studies or extracting information from articles. But it is still unclear how reliable these tools are for research purposes. This is the first systematic review to assess LLMs' performance to facilitate systematic reviews. We sought to review all studies that tested LLMs in the different steps of systematic reviews and found 63 studies evaluating how well these tools performed compared with human reviewers. Overall, LLMs showed good agreement with humans for tasks such as screening titles and abstracts, and full-text articles. Newer models seemed to perform better than older ones. However, performance was more variable for complex tasks that require interpretation, such as extracting detailed data or assessing methodological quality. Our findings suggest that LLMs could help researchers work faster and make systematic reviews more efficient. However, they are not ready to replace human judgment. These tools can make mistakes, produce inconsistent results, or generate inaccurate information if not carefully supervised. In practice, LLMs should be used as assistants rather than substitutes. With proper safeguards, transparent reporting, and human oversight, they may become valuable tools to support evidence-based healthcare and help keep research up to date.

  • Research Article
  • 10.1164/ajrccm.2025.211.abstracts.a2086
Leveraging Generative Pre-Trained Transformer (GPT) Large Language Models (LLMs) for Interstitial Lung Diseases (ILD) Clinical Research
  • May 1, 2025
  • American Journal of Respiratory and Critical Care Medicine
  • S Chen + 4 more

Rationale: The majority of clinically relevant data is contained in unstructured text such as clinical notes. ILD notes are particularly prone to verbosity and imprecision, making structured data extraction a major bottleneck for clinical research and a costly endeavor when maintaining ILD registries and databases. In this study, we explore the utility and performance characteristics of current GPT LLMs for extracting structured binary data from unstructured clinical text. Methods: We used GPT-3.5-turbo, GPT-4, GPT-4o, GPT-4o-mini and GPT-o1-mini models using a protected health information (PHI) compliant API pipeline and Llama 3.3-70B, Llama 3.1-8B and Mistral-7b-0.3 running locally on GPUs. Patients who were initially seen in the Stanford ILD clinic between 2018 to 2022, with at least three follow-up visits were selected for this study. Prompt engineering was performed using notes randomly selected from 10 patients of this cohort. Various prompt types (simple, heuristic, and chain of thought) were tested and those with best performance were selected. 100 randomly selected clinic notes from the patient cohort that had not been previously used in the prompt engineering stage were then processed through the LLM pipeline. Three ILD physicians independently scored the same notes and prompts. The discordant answers were reviewed by two ILD physicians to come to a consensus answer, which was considered the ground truth for this study. In addition to answering questions, LLMs can also provide a measure of its confidence in the answer. We used the logprobs parameter to assess the LLM's confidence in its answer, and compared the accuracy for answers in which the LLM was confident versus not. Results: It took approximately 1-2 seconds to process each clinical note-prompt combination. The three ILD physicians’ mean accuracy was 96%, similar to that of GPT-4, GPT-4o, GPT-4o-mini and GPT-o1-mini (Table 1). GPT 3.5, Llama 3.3-70B, Llama 3.1-8B and Mistral-7b-0.3 ‘s accuracy was much lower, to the point of not being clinically useful. The LLMs were substantially less accurate for questions where the LLM's level of confidence was low (&amp;lt;80%). Conclusions: GPT LLMs demonstrate human level accuracy while being orders of magnitude faster for extracting structured binary data from unstructured ILD clinical data. Using the LLM's confidence in the answer can further identify records requiring human review, allowing for additional improvement in accuracy. Incorporating these GPT LLMs in ILD clinical research has the potential to dramatically accelerate data-driven insights, streamline research workflows, and elevate clinical research to a new level.

  • Research Article
  • Cite Count Icon 1
  • 10.1093/eurheartj/ehae666.3491
A guideline-informed language model for paediatric cardiology demonstrates high performance in answering complex medical questions
  • Oct 28, 2024
  • European Heart Journal
  • T Uden + 13 more

Background Paediatric cardiology presents unique challenges with its diverse and complex cases, limited evidence base, and the necessity for multi-expert involvement in decision-making processes. In this context, the introduction of generative pre-trained transformer (GPT) based large language models (LLMs) offers a potential avenue for the provision of complex information and clinical decision support. Purpose This study evaluates the quality of three different GPT LLMs in answering complex medical questions, including a state-of-the-art preview model that incorporates the German paediatric cardiology guidelines. Methods Seven paediatric cardiologists and paediatric cardiac surgeons generated 72 questions, including complex questions and medical cases with associated questions. The questions were categorized by difficulty and required knowledge (factual and experience-based or mostly experience-based). We prompted the questions to three LLMs: GPT 3.5, GPT 4 and a GPT 4 turbo preview. The GPT 4 turbo preview was customized by incorporating all guidelines from the German Society for Paediatric Cardiology by a retrieval function. Employing one complex instruction for all questions, we prompted the LLMs to provide precise and detailed expert-level responses. The responses from each model were evaluated by experts based on relevance, factual accuracy, severity of possible harm, completeness, superfluous content, and age-related appropriateness from 0 (very bad) to 7 (very good). Differences were calculated using the Kruskal-Wallis-test in SPSS Version 28. Results Our findings indicated a good performance of all models regarding the dimensions tested. The figures show the average ratings (Figure 1, Figure 2A) and highlight significant differences after Bonferroni correction in bold (Figure 2B). The GPT 4 turbo preview, including the retrieval of guidelines, provided significantly more relevant (average rating [AR] 5.94, meaning mostly relevant), accurate (AR 5.6, meaning between somewhat and mostly accurate) and complete (AR 5, meaning fairly complete) answers compared to GPT 3.5 and GPT 4. In terms of difficulty levels or the type of questions, there was no significant difference in rating. Relevance ratings were slightly better in factual questions (AR 5.7) than in those requiring more experience-based knowledge (AR 5,3). Although GPT4 had higher average scores compared to GPT 3.5 in all dimensions except superfluous content, the differences in rating were not statistically significant. All models had relevant difficulties considering the age-related aspects of the questions (AR 4.06-4.45, p=0.455). Conclusion This study highlights the potential and limitations of AI language models in addressing complex medical questions in fields characterized by highly individualized decision-making scenarios. The findings advocate for the development of more specialized AI tools in medicine, tailored to specific medical fields and patient age groups.Fig 1:Average ratings of LLMsFig 2:Rating differences between LLMs

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 124
  • 10.1001/jamanetworkopen.2024.8895
Use of a Large Language Model to Assess Clinical Acuity of Adults in the Emergency Department
  • May 7, 2024
  • JAMA Network Open
  • Christopher Y K Williams + 6 more

The introduction of large language models (LLMs), such as Generative Pre-trained Transformer 4 (GPT-4; OpenAI), has generated significant interest in health care, yet studies evaluating their performance in a clinical setting are lacking. Determination of clinical acuity, a measure of a patient's illness severity and level of required medical attention, is one of the foundational elements of medical reasoning in emergency medicine. To determine whether an LLM can accurately assess clinical acuity in the emergency department (ED). This cross-sectional study identified all adult ED visits from January 1, 2012, to January 17, 2023, at the University of California, San Francisco, with a documented Emergency Severity Index (ESI) acuity level (immediate, emergent, urgent, less urgent, or nonurgent) and with a corresponding ED physician note. A sample of 10 000 pairs of ED visits with nonequivalent ESI scores, balanced for each of the 10 possible pairs of 5 ESI scores, was selected at random. The potential of the LLM to classify acuity levels of patients in the ED based on the ESI across 10 000 patient pairs. Using deidentified clinical text, the LLM was queried to identify the patient with a higher-acuity presentation within each pair based on the patients' clinical history. An earlier LLM was queried to allow comparison with this model. Accuracy score was calculated to evaluate the performance of both LLMs across the 10 000-pair sample. A 500-pair subsample was manually classified by a physician reviewer to compare performance between the LLMs and human classification. From a total of 251 401 adult ED visits, a balanced sample of 10 000 patient pairs was created wherein each pair comprised patients with disparate ESI acuity scores. Across this sample, the LLM correctly inferred the patient with higher acuity for 8940 of 10 000 pairs (accuracy, 0.89 [95% CI, 0.89-0.90]). Performance of the comparator LLM (accuracy, 0.84 [95% CI, 0.83-0.84]) was below that of its successor. Among the 500-pair subsample that was also manually classified, LLM performance (accuracy, 0.88 [95% CI, 0.86-0.91]) was comparable with that of the physician reviewer (accuracy, 0.86 [95% CI, 0.83-0.89]). In this cross-sectional study of 10 000 pairs of ED visits, the LLM accurately identified the patient with higher acuity when given pairs of presenting histories extracted from patients' first ED documentation. These findings suggest that the integration of an LLM into ED workflows could enhance triage processes while maintaining triage quality and warrants further investigation.

  • Research Article
  • 10.3389/fcell.2025.1574378
Evaluating multiple large language models on orbital diseases
  • Jul 7, 2025
  • Frontiers in Cell and Developmental Biology
  • Qi-Chen Yang + 7 more

The avoidance of mistakes by humans is achieved through continuous learning, error correction, and experience accumulation. This process is known to be both time-consuming and laborious, often involving numerous detours. In order to assist humans in their learning endeavors, ChatGPT (Generative Pre-trained Transformer) has been developed as a collection of large language models (LLMs) capable of generating responses that resemble human-like answers to a wide range of problems. In this study, we sought to assess the potential of LLMs as assistants in addressing queries related to orbital diseases. To accomplish this, we gathered a dataset consisting of 100 orbital questions, along with their corresponding answers, sourced from examinations administered to ophthalmologist residents and medical students. Five language models (LLMs) were utilized for testing and comparison purposes, namely, GPT-4, GPT-3.5, PaLM2, Claude 2, and SenseNova. Subsequently, the LLM exhibiting the most exemplary performance was selected for comparison against ophthalmologists and medical students. Notably, GPT-4 and PaLM2 demonstrated a superior average correlation when compared to the other LLMs. Furthermore, GPT-4 exhibited a broader spectrum of accurate responses and attained the highest average score among all the LLMs. Additionally, GPT-4 demonstrated the highest level of confidence during the test. The performance of GPT-4 surpassed that of medical students, albeit falling short of that exhibited by ophthalmologists. In contrast, the findings of the study indicate that GPT-4 exhibited superior performance within the orbital domain of ophthalmology. Given further refinement through training, LLMs possess considerable potential to be utilized as comprehensive instruments alongside medical students and ophthalmologists.

  • Discussion
  • 10.14245/ns.2448236.118
Commentary on “Performance of a Large Language Model in the Generation of Clinical Guidelines for Antibiotic Prophylaxis in Spine Surgery”
  • Mar 1, 2024
  • Neurospine
  • Sun-Ho Lee

The introduction of artificial intelligence (AI), particularly large language models (LLMs) such as the generative pre-trained transformer (GPT) series into the medical field has heralded a new era of data-driven medicine. AI's capacity for processing vast datasets has enabled the development of predictive models that can forecast patient outcomes with remarkable accuracy. LLMs like GPT and its successors have demonstrated an ability to understand and generate human-like text, facilitating their application in medical documentation, patient interaction, and even in generating diagnostic reports from patient data and imaging findings. Over the past 10 years, the development of AI, LLMs, and GPTs has significantly impacted the field of neurosurgery and spinal care as well. [1] [2] [3] [4] [5] Zaidat et al. 6 studied performance of a LLM in the generation of clinical guidelines for antibiotic prophylaxis in spine surgery. This study delves into the capabilities of ChatGPT's models, GPT-3.5 and GPT-4.0, showcasing their potential to streamline medical processes. They suggest that GPT-3.5's ability to generate clinically relevant antibiotic use guidelines for spinal surgery is commendable; however, its limitations, such as the inability to discern the most crucial aspects of the guidelines, redundancy, fabrication of citations, and inconsistency, pose significant barriers to its practical application. GPT-4.0, on the other hand, demonstrates a marked improvement in response accuracy and the ability to cite authoritative guidelines, such as those from the North American Spine Society (NASS). This model's enhanced performance, including a 20% increase in response accuracy and the ability to cite the NASS guideline in over 60% of responses, suggests a more reliable tool for clinicians seeking to integrate AI-generated content into their practice. However, the study's findings also highlight the inherent unpredictability of LLM responses and the potential for "artificial hallucination, " where models generate spurious statements without a solid basis in their training data. This phenomenon raises concerns about the ethical implications of using LLMs in clinical settings, particularly regarding patient care and liability. The possibility of LLMs providing inaccurate responses, especially when prompted for medical advice, necessitates a cautious approach to their deployment. We also pay attention to the limitations of the study itself, including the outdated nature of the NASS guidelines, which have not been updated since 2013, and the potential biases and gaps in the medical knowledge contained within the LLMs' training data. These factors highlight the im-Neurospine

  • Research Article
  • 10.1016/j.joms.2025.09.016
Evaluating the Effectiveness of Large Language Models in Addressing Patient Queries Regarding Maxillomandibular Fixation for Maxillofacial Fractures.
  • Oct 1, 2025
  • Journal of oral and maxillofacial surgery : official journal of the American Association of Oral and Maxillofacial Surgeons
  • Ragavi Alagarsamy + 6 more

Evaluating the Effectiveness of Large Language Models in Addressing Patient Queries Regarding Maxillomandibular Fixation for Maxillofacial Fractures.

  • Research Article
  • 10.1016/j.bjoms.2025.08.015
Comparison of large language models in oral and maxillofacial surgery.
  • Jan 1, 2026
  • The British journal of oral & maxillofacial surgery
  • Ricardo Grillo + 3 more

Comparison of large language models in oral and maxillofacial surgery.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant