Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Commentary on “Performance of a Large Language Model in the Generation of Clinical Guidelines for Antibiotic Prophylaxis in Spine Surgery”

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

The introduction of artificial intelligence (AI), particularly large language models (LLMs) such as the generative pre-trained transformer (GPT) series into the medical field has heralded a new era of data-driven medicine. AI's capacity for processing vast datasets has enabled the development of predictive models that can forecast patient outcomes with remarkable accuracy. LLMs like GPT and its successors have demonstrated an ability to understand and generate human-like text, facilitating their application in medical documentation, patient interaction, and even in generating diagnostic reports from patient data and imaging findings. Over the past 10 years, the development of AI, LLMs, and GPTs has significantly impacted the field of neurosurgery and spinal care as well. [1] [2] [3] [4] [5] Zaidat et al. 6 studied performance of a LLM in the generation of clinical guidelines for antibiotic prophylaxis in spine surgery. This study delves into the capabilities of ChatGPT's models, GPT-3.5 and GPT-4.0, showcasing their potential to streamline medical processes. They suggest that GPT-3.5's ability to generate clinically relevant antibiotic use guidelines for spinal surgery is commendable; however, its limitations, such as the inability to discern the most crucial aspects of the guidelines, redundancy, fabrication of citations, and inconsistency, pose significant barriers to its practical application. GPT-4.0, on the other hand, demonstrates a marked improvement in response accuracy and the ability to cite authoritative guidelines, such as those from the North American Spine Society (NASS). This model's enhanced performance, including a 20% increase in response accuracy and the ability to cite the NASS guideline in over 60% of responses, suggests a more reliable tool for clinicians seeking to integrate AI-generated content into their practice. However, the study's findings also highlight the inherent unpredictability of LLM responses and the potential for "artificial hallucination, " where models generate spurious statements without a solid basis in their training data. This phenomenon raises concerns about the ethical implications of using LLMs in clinical settings, particularly regarding patient care and liability. The possibility of LLMs providing inaccurate responses, especially when prompted for medical advice, necessitates a cautious approach to their deployment. We also pay attention to the limitations of the study itself, including the outdated nature of the NASS guidelines, which have not been updated since 2013, and the potential biases and gaps in the medical knowledge contained within the LLMs' training data. These factors highlight the im-Neurospine

Similar Papers
  • Research Article
  • Cite Count Icon 28
  • 10.14245/ns.2347310.655
Performance of a Large Language Model in the Generation of Clinical Guidelines for Antibiotic Prophylaxis in Spine Surgery
  • Mar 1, 2024
  • Neurospine
  • Bashar Zaidat + 10 more

ObjectiveLarge language models, such as chat generative pre-trained transformer (ChatGPT), have great potential for streamlining medical processes and assisting physicians in clinical decision-making. This study aimed to assess the potential of ChatGPT’s 2 models (GPT-3.5 and GPT-4.0) to support clinical decision-making by comparing its responses for antibiotic prophylaxis in spine surgery to accepted clinical guidelines.MethodsChatGPT models were prompted with questions from the North American Spine Society (NASS) Evidence-based Clinical Guidelines for Multidisciplinary Spine Care for Antibiotic Prophylaxis in Spine Surgery (2013). Its responses were then compared and assessed for accuracy.ResultsOf the 16 NASS guideline questions concerning antibiotic prophylaxis, 10 responses (62.5%) were accurate in ChatGPT’s GPT-3.5 model and 13 (81%) were accurate in GPT-4.0. Twenty-five percent of GPT-3.5 answers were deemed as overly confident while 62.5% of GPT-4.0 answers directly used the NASS guideline as evidence for its response.ConclusionChatGPT demonstrated an impressive ability to accurately answer clinical questions. GPT-3.5 model’s performance was limited by its tendency to give overly confident responses and its inability to identify the most significant elements in its responses. GPT-4.0 model’s responses had higher accuracy and cited the NASS guideline as direct evidence many times. While GPT-4.0 is still far from perfect, it has shown an exceptional ability to extract the most relevant research available compared to GPT-3.5. Thus, while ChatGPT has shown far-reaching potential, scrutiny should still be exercised regarding its clinical use at this time.

  • Abstract
  • Cite Count Icon 3
  • 10.1182/blood-2023-185854
Evaluating the Performance of Large Language Models in Hematopoietic Stem Cell Transplantation Decision Making
  • Nov 2, 2023
  • Blood
  • Ivan Civettini + 14 more

Evaluating the Performance of Large Language Models in Hematopoietic Stem Cell Transplantation Decision Making

  • Front Matter
  • Cite Count Icon 138
  • 10.1016/j.spinee.2013.06.030
An evidence-based clinical guideline for antibiotic prophylaxis in spine surgery
  • Aug 27, 2013
  • The spine journal : official journal of the North American Spine Society
  • William O Shaffer + 3 more

An evidence-based clinical guideline for antibiotic prophylaxis in spine surgery

  • Abstract
  • 10.1136/jnis-2024-snis.290
E-185 Customized generative pretrained transformer for simplified patient education of carotid angioplasty and stenting: a feasibility study
  • Jul 1, 2024
  • Journal of NeuroInterventional Surgery
  • A Brake + 3 more

Maintaining patient autonomy necessitates a clear understanding of surgical procedures prior to consent. Time constraints, patient literacy, and the complexity of medical terminology pose challenges in conveying this information. Recent...

  • Research Article
  • Cite Count Icon 56
  • 10.5204/mcj.3004
ChatGPT Isn't Magic
  • Oct 2, 2023
  • M/C Journal
  • Tama Leaver + 1 more

Introduction Author Arthur C. Clarke famously argued that in science fiction literature “any sufficiently advanced technology is indistinguishable from magic” (Clarke). On 30 November 2022, technology company OpenAI publicly released their Large Language Model (LLM)-based chatbot ChatGPT (Chat Generative Pre-Trained Transformer), and instantly it was hailed as world-changing. Initial media stories about ChatGPT highlighted the speed with which it generated new material as evidence that this tool might be both genuinely creative and actually intelligent, in both exciting and disturbing ways. Indeed, ChatGPT is part of a larger pool of Generative Artificial Intelligence (AI) tools that can very quickly generate seemingly novel outputs in a variety of media formats based on text prompts written by users. Yet, claims that AI has become sentient, or has even reached a recognisable level of general intelligence, remain in the realm of science fiction, for now at least (Leaver). That has not stopped technology companies, scientists, and others from suggesting that super-smart AI is just around the corner. Exemplifying this, the same people creating generative AI are also vocal signatories of public letters that ostensibly call for a temporary halt in AI development, but these letters are simultaneously feeding the myth that these tools are so powerful that they are the early form of imminent super-intelligent machines. For many people, the combination of AI technologies and media hype means generative AIs are basically magical insomuch as their workings seem impenetrable, and their existence could ostensibly change the world. This article explores how the hype around ChatGPT and generative AI was deployed across the first six months of 2023, and how these technologies were positioned as either utopian or dystopian, always seemingly magical, but never banal. We look at some initial responses to generative AI, ranging from schools in Australia to picket lines in Hollywood. We offer a critique of the utopian/dystopian binary positioning of generative AI, aligning with critics who rightly argue that focussing on these extremes displaces the more grounded and immediate challenges generative AI bring that need urgent answers. Finally, we loop back to the role of schools and educators in repositioning generative AI as something to be tested, examined, scrutinised, and played with both to ground understandings of generative AI, while also preparing today’s students for a future where these tools will be part of their work and cultural landscapes. Hype, Schools, and Hollywood In December 2022, one month after OpenAI launched ChatGPT, Elon Musk tweeted: “ChatGPT is scary good. We are not far from dangerously strong AI”. Musk’s post was retweeted 9400 times, liked 73 thousand times, and presumably seen by most of his 150 million Twitter followers. This type of engagement typified the early hype and language that surrounded the launch of ChatGPT, with reports that “crypto” had been replaced by generative AI as the “hot tech topic” and hopes that it would be “‘transformative’ for business” (Browne). By March 2023, global economic analysts at Goldman Sachs had released a report on the potentially transformative effects of generative AI, saying that it marked the “brink of a rapid acceleration in task automation that will drive labor cost savings and raise productivity” (Hatzius et al.). Further, they concluded that “its ability to generate content that is indistinguishable from human-created output and to break down communication barriers between humans and machines reflects a major advancement with potentially large macroeconomic effects” (Hatzius et al.). Speculation about the potentially transformative power and reach of generative AI technology was reinforced by warnings that it could also lead to “significant disruption” of the labour market, and the potential automation of up to 300 million jobs, with associated job losses for humans (Hatzius et al.). In addition, there was widespread buzz that ChatGPT’s “rationalization process may evidence human-like cognition” (Browne), claims that were supported by the emergent language of ChatGPT. The technology was explained as being “trained” on a “corpus” of datasets, using a “neural network” capable of producing “natural language“” (Dsouza), positioning the technology as human-like, and more than ‘artificial’ intelligence. Incorrect responses or errors produced by the tech were termed “hallucinations”, akin to magical thinking, which OpenAI founder Sam Altman insisted wasn’t a word that he associated with sentience (Intelligencer staff). Indeed, Altman asserts that he rejects moves to “anthropomorphize” (Intelligencer staff) the technology; however, arguably the language, hype, and Altman’s well-publicised misgivings about ChatGPT have had the combined effect of shaping our understanding of this generative AI as alive, vast, fast-moving, and potentially lethal to humanity. Unsurprisingly, the hype around the transformative effects of ChatGPT and its ability to generate ‘human-like’ answers and sophisticated essay-style responses was matched by a concomitant panic throughout educational institutions. The beginning of the 2023 Australian school year was marked by schools and state education ministers meeting to discuss the emerging problem of ChatGPT in the education system (Hiatt). Every state in Australia, bar South Australia, banned the use of the technology in public schools, with a “national expert task force” formed to “guide” schools on how to navigate ChatGPT in the classroom (Hiatt). Globally, schools banned the technology amid fears that students could use it to generate convincing essay responses whose plagiarism would be undetectable with current software (Clarence-Smith). Some schools banned the technology citing concerns that it would have a “negative impact on student learning”, while others cited its “lack of reliable safeguards preventing these tools exposing students to potentially explicit and harmful content” (Cassidy). ChatGPT investor Musk famously tweeted, “It’s a new world. Goodbye homework!”, further fuelling the growing alarm about the freely available technology that could “churn out convincing essays which can't be detected by their existing anti-plagiarism software” (Clarence-Smith). Universities were reported to be moving towards more “in-person supervision and increased paper assessments” (SBS), rather than essay-style assessments, in a bid to out-manoeuvre ChatGPT’s plagiarism potential. Seven months on, concerns about the technology seem to have been dialled back, with educators more curious about the ways the technology can be integrated into the classroom to good effect (Liu et al.); however, the full implications and impacts of the generative AI are still emerging. In May 2023, the Writer’s Guild of America (WGA), the union representing screenwriters across the US creative industries, went on strike, and one of their core issues were “regulations on the use of artificial intelligence in writing” (Porter). Early in the negotiations, Chris Keyser, co-chair of the WGA’s negotiating committee, lamented that “no one knows exactly what AI’s going to be, but the fact that the companies won’t talk about it is the best indication we’ve had that we have a reason to fear it” (Grobar). At the same time, the Screen Actors’ Guild (SAG) warned that members were being asked to agree to contracts that stipulated that an actor’s voice could be re-used in future scenarios without that actor’s additional consent, potentially reducing actors to a dataset to be animated by generative AI technologies (Scheiber and Koblin). In a statement issued by SAG, they made their position clear that the creation or (re)animation of any digital likeness of any part of an actor must be recognised as labour and properly paid, also warning that any attempt to legislate around these rights should be strongly resisted (Screen Actors Guild). Unlike the more sensationalised hype, the WGA and SAG responses to generative AI are grounded in labour relations. These unions quite rightly fear the immediate future where human labour could be augmented, reclassified, and exploited by, and in the name of, algorithmic systems. Screenwriters, for example, might be hired at much lower pay rates to edit scripts first generated by ChatGPT, even if those editors would really be doing most of the creative work to turn something clichéd and predictable into something more appealing. Rather than a dystopian world where machines do all the work, the WGA and SAG protests railed against a world where workers would be paid less because executives could pretend generative AI was doing most of the work (Bender). The Open Letter and Promotion of AI Panic In an open letter that received enormous press and media uptake, many of the leading figures in AI called for a pause in AI development since “advanced AI could represent a profound change in the history of life on Earth”; they warned early 2023 had already seen “an out-of-control race to develop and deploy ever more powerful digital minds that no one – not even their creators – can understand, predict, or reliably control” (Future of Life Institute). Further, the open letter signatories called on “all AI labs to immediately pause for at least 6 months the training of AI systems more powerful than GPT-4”, arguing that “labs and independent experts should use this pause to jointly develop and implement a set of shared safety protocols for advanced AI design and development that are rigorously audited and overseen by independent outside experts” (Future of Life Institute). Notably, many of the signatories work for the very companies involved in the “out-of-control race”. Indeed, while this letter could be read as a moment of ethical clarity for the AI industry, a more cynical reading might just be that in warning that their AIs could effectively destroy the w

  • Research Article
  • Cite Count Icon 73
  • 10.1016/j.jclinepi.2025.111746
Large language models for conducting systematic reviews: on the rise, but not yet ready for use-a scoping review.
  • May 1, 2025
  • Journal of clinical epidemiology
  • Judith-Lisa Lieberum + 8 more

Machine learning promises versatile help in the creation of systematic reviews (SRs). Recently, further developments in the form of large language models (LLMs) and their application in SR conduct attracted attention. We aimed at providing an overview of LLM applications in SR conduct in health research. We systematically searched MEDLINE, Web of Science, IEEEXplore, ACM Digital Library, Europe PMC (preprints), Google Scholar, and conducted an additional hand search (last search: February 26, 2024). We included scientific articles in English or German, published from April 2021 onwards, building upon the results of a mapping review that has not yet identified LLM applications to support SRs. Two reviewers independently screened studies for eligibility; after piloting, 1 reviewer extracted data, checked by another. Our database search yielded 8054 hits, and we identified 33 articles from our hand search. We finally included 37 articles on LLM support. LLM approaches covered 10 of 13 defined SR steps, most frequently literature search (n = 15, 41%), study selection (n = 14, 38%), and data extraction (n = 11, 30%). The mostly recurring LLM was Generative Pretrained Transformer (GPT) (n = 33, 89%). Validation studies were predominant (n = 21, 57%). In half of the studies, authors evaluated LLM use as promising (n = 20, 54%), one-quarter as neutral (n = 9, 24%) and one-fifth as nonpromising (n = 8, 22%). Although LLMs show promise in supporting SR creation, fully established or validated applications are often lacking. The rapid increase in research on LLMs for evidence synthesis production highlights their growing relevance. Systematic reviews are a crucial tool in health research where experts carefully collect and analyze all available evidence on a specific research question. Creating these reviews is typically time- and resource-intensive, often taking months or even years to complete, as researchers must thoroughly search, evaluate, and synthesize an immense number of scientific studies. For the present article, we conducted a review to understand how new artificial intelligence (AI) tools, specifically large language models (LLMs) like Generative Pretrained Transformer (GPT), can be used to help create systematic reviews in health research. We searched multiple scientific databases and finally found 37 relevant articles. We found that LLMs have been tested to help with various parts of the systematic review process, particularly in 3 main areas: searching scientific literature (41% of studies), selecting relevant studies (38%), and extracting important information from these studies (30%). GPT was the most commonly used LLM, appearing in 89% of the studies. Most of the research (57%) focused on testing whether these AI tools actually work as intended in this context of systematic review production. The results were mixed: about half of the studies found LLMs promising, a quarter were neutral, and one-fifth found them not promising. While LLMs show potential for making the systematic review process more efficient, there is still a lack of fully tested and validated applications. However, the increasing number of studies in this field suggests that these AI tools are becoming increasingly important in creating systematic reviews.

  • Research Article
  • Cite Count Icon 1
  • 10.1101/2025.02.27.640661
SensitiveCancerGPT: Leveraging Generative Large Language Model on Structured Omics Data to Optimize Drug Sensitivity Prediction.
  • Mar 3, 2025
  • bioRxiv : the preprint server for biology
  • Shaika Chowdhury + 6 more

The fast accumulation of vast pharmacogenomics data of cancer cell lines provide unprecedented opportunities for drug sensitivity prediction (DSP), a crucial prerequisite for the advancement of precision oncology. Recently, Generative Large Language Models (LLM) have demonstrated performance and generalization prowess across diverse tasks in the field of natural language processing (NLP). However, the structured format of the pharmacogenomics data poses challenge for the utility of LLM in DSP. Therefore, the objective of this study is multi-fold: to adapt prompt engineering for structured pharmacogenomics data toward optimizing LLM's DSP performance, to evaluate LLM's generalization in real-world DSP scenarios, and to compare LLM's DSP performance against that of state-of-the-science baselines. We systematically investigated the capability of the Generative Pre-trained Transformer (GPT) as a DSP model on four publicly available benchmark pharmacogenomics datasets, which are stratified by five cancer tissue types of cell lines and encompass both oncology and non-oncology drugs. Essentially, the predictive landscape of GPT is assessed for effectiveness on the DSP task via four learning paradigms: zero-shot learning, few-shot learning, fine-tuning and clustering pretrained embeddings. To facilitate GPT in seamlessly processing the structured pharmacogenomics data, domain-specific novel prompt engineering is employed by implementing three prompt templates (i.e., Instruction, Instruction-Prefix, Cloze) and integrating pharmacogenomics-related features into the prompt. We validated GPT's performance in diverse real-world DSP scenarios: cross-tissue generalization, blind tests, and analyses of drug-pathway associations and top sensitive/resistant cell lines. Furthermore, we conducted a comparative evaluation of GPT against multiple Transformer-based pretrained models and existing DSP baselines. Extensive experiments on the pharmacogenomics datasets across the five tissue cohorts demonstrate that fine-tuning GPT yields the best DSP performance (28% F1 increase, p-value= 0.0003) followed by clustering pretrained GPT embeddings (26% F1 increase, p-value= 0.0005), outperforming GPT in-context learning (i.e., few-shot). However, GPT in the zero-shot setting had a big F1 gap, resulting in the worst performance. Within the scope of prompt engineering, performance enhancement was achieved by directly instructing GPT about the DSP task and resorting to a concise context format (i.e., instruction-prefix), leading to F1 performance gain of 22% (p-value=0.02); while incorporation of drug-cell line prompt context derived from genomics and/or molecular features further boosted F1 score by 2%. Compared to state-of-the-science DSP baselines, GPT significantly asserted superior mean F1 performance (16% gain, p-value<0.05) on the GDSC dataset. In the cross-tissue analysis, GPT showcased comparable generalizability to the within-tissue performances on the GDSC and PRISM datasets, while statistically significant F1 performance improvements on the CCLE (8%, p-value=0.001) and DrugComb (19%, p-value=0.009) datasets. Evaluation on the challenging blind tests suggests GPT's competitiveness on the CCLE and DrugComb datasets compared to random splitting. Furthermore, analyses of the drug-pathway associations and log probabilities provided valuable insights that align with previous DSP findings. The diverse experiment setups and in-depth analysis underscore the importance of generative LLM, such as GPT, as a viable in silico approach to guide precision oncology. https://github.com/bioIKEA/SensitiveCancerGPT.

  • Research Article
  • Cite Count Icon 2
  • 10.1371/journal.pdig.0000980
Development and evaluation of large-language models (LLMs) for oncology: A scoping review
  • Aug 7, 2025
  • PLOS Digital Health
  • Namya Mehan + 2 more

Large language models (LLMs), a significant development in artificial intelligence (AI), are continuing to demonstrate seminal improvement in performance for various text analysis and generation tasks. There are limited systematic studies on LLM applications that were developed/evaluated in relevance to oncology. Our scoping review explores applications of LLMs in oncology to determine (1) the nature of LLM applications relevant to a cancer/tumor type, (2) the phases of cancer care addressed by the LLMs, (3) which LLMs were used in these applications, (4) the sources and pre-processing of datasets used, (5) the techniques used to optimize the performance of LLMs, (6) the methods of evaluation, and (7) the common limitations noted by the authors of these LLM applications and to study their implications in research and practice. A librarian-assisted search was performed across the following databases: Association for Computing Machinery (ACM), Embase, Engineering Village, IEEE Xplore, Medline, Scopus, SPIE and Web of Science till Jan 12, 2024. Pre-prints from this search were considered if they were published/accepted by Feb 29, 2024. From the initial search of 14863 articles, 60 were finally included. Our results demonstrated that LLMs were mostly evaluated across a diverse set of oncology-related applications. Generative pre-trained transformer (GPT)-based LLMs were mostly used. In the subset of studies where the phase(s) of cancer care was/were provided or implied, treatment and diagnosis were the most included phases. Data for development and evaluation extended from patient health records, synthetic patient records, research and professional society publications to social media. Prompt-designing and engineering were performed as data pre-processing steps in several studies. Clinicians, trainees, researchers, and patients were among the variety of users targeted by the applications. In the17% studies that developed LLMs for oncological aspects, domain adaptation through pre-training and fine-tuning were often performed and resulted in performance improvement. The evaluation of an LLM’s performance involved usage of both standard, validated, non-standardized, and/or customized performance measures considering a variety of constructs, other than accuracy. Six primary themes emerged as limitations including limitation of generalizability/applicability, sample size, bias and subjectivity, and evaluation metrics. This review highlights that LLMs, specific to oncological aspects, are less common than general-purpose LLMs. The application areas were heterogeneous, used diverse data sources, were directed towards a variety of users, and resulted in variety of evaluation methods. Despite the diversity of LLM applications in oncology, future research needs to address the limited generalizability of these applications, mitigation of bias and subjectivity, and standardization of evaluation methodologies. Future applications of LLMs in oncology should include developing oncology-specific LLMs that can mitigate knowledge gaps and extend to diverse areas of oncology training and practice not considered so far.

  • Research Article
  • Cite Count Icon 11
  • 10.1287/ijds.2023.0007
How Can IJDS Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
  • Apr 1, 2023
  • INFORMS Journal on Data Science
  • Galit Shmueli + 7 more

How Can <i>IJDS</i> Authors, Reviewers, and Editors Use (and Misuse) Generative AI?

  • Research Article
  • Cite Count Icon 21
  • 10.1016/j.joms.2024.11.007
Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential
  • Mar 1, 2025
  • Journal of Oral and Maxillofacial Surgery
  • Reema Mahmoud + 5 more

Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential

  • Research Article
  • Cite Count Icon 67
  • 10.1016/j.spinee.2023.07.015
Thromboembolic prophylaxis in spine surgery: an analysis of ChatGPT recommendations
  • Jul 25, 2023
  • The spine journal : official journal of the North American Spine Society
  • Akiro H Duey + 13 more

Thromboembolic prophylaxis in spine surgery: an analysis of ChatGPT recommendations

  • Research Article
  • 10.7759/cureus.99472
Evaluating Multimodal Large Language Model (LLM) (Generative Pre-trained Transformer 5 (GPT-5)) for Meniscal Tear Detection on Knee Magnetic Resonance Imaging (MRI): A Pilot Study
  • Dec 17, 2025
  • Cureus
  • Kwan Kit Chan + 2 more

IntroductionMagnetic resonance imaging (MRI) of the knee is the gold standard for evaluating meniscal injuries. While specialized artificial intelligence (AI) models have demonstrated high diagnostic capability in detecting meniscal tears, the performance of general-purpose large language models (LLMs) with multimodal vision capabilities remains underexplored. Previous iterations, such as generative pre-trained transformer 4 (GPT-4) (OpenAI, San Francisco, CA, USA) with vision, have shown limited success in direct musculoskeletal image interpretation. This study evaluates the diagnostic performance of the latest-generation LLM, generative pre-trained transformer 5 (GPT-5), in detecting meniscal tears on knee MRI.ObjectivesThis study aimed to evaluate the diagnostic performance of GPT-5 (a general-purpose multimodal LLM) in detecting meniscal tears on knee MRI in a zero-shot setting, using a publicly available dataset.Materials and methodsOne hundred knee MRI examinations (50 with meniscal tears, 50 without) were randomly selected from the MRNet validation dataset, with ground-truth labels extracted from the dataset. Sagittal T2-weighted and coronal T1-weighted series were reviewed for completeness and image quality and then converted to Portable Network Graphics (PNG) slices. GPT-5 (gpt-5-2025-08-07) analyzed each case in zero-shot fashion using a fixed prompt requesting a binary ("yes/no") determination of meniscal tear presence without any clinical context. Model predictions were compared with ground truth, and accuracy, precision, recall, specificity, and F1-scores were calculated with 95% confidence intervals.ResultsGPT-5 achieved an overall accuracy of 76% (95% CI: 0.668-0.833). The model demonstrated a sensitivity (recall) of 84% (95% CI: 0.715-0.917) and a specificity of 68% (95% CI: 0.542-0.792). The precision for detecting tears was 72.4%, and the F1-score was 0.778.ConclusionIn this pilot study, GPT-5 demonstrates potential in the zero-shot interpretation of knee MRIs for meniscal tear detection, outperforming previous multimodal LLMs. However, the results should be interpreted with caution due to study limitations, and clinical utility is currently limited by a high false-positive rate and lack of visual explainability. Nevertheless, this pilot evaluation provides an initial proof of concept, and with larger datasets, rigorous validation, improved calibration, and enhanced explainability, future multimodal LLMs may evolve into supportive, human-in-the-loop tools in musculoskeletal radiology.

  • Research Article
  • Cite Count Icon 102
  • 10.1016/j.spinee.2008.05.008
Antibiotic prophylaxis in spine surgery: an evidence-based clinical guideline for the use of prophylactic antibiotics in spine surgery
  • Jul 10, 2008
  • The Spine Journal
  • William C Watters + 6 more

Antibiotic prophylaxis in spine surgery: an evidence-based clinical guideline for the use of prophylactic antibiotics in spine surgery

  • Research Article
  • Cite Count Icon 1
  • 10.1093/eurheartj/ehae666.3491
A guideline-informed language model for paediatric cardiology demonstrates high performance in answering complex medical questions
  • Oct 28, 2024
  • European Heart Journal
  • T Uden + 13 more

Background Paediatric cardiology presents unique challenges with its diverse and complex cases, limited evidence base, and the necessity for multi-expert involvement in decision-making processes. In this context, the introduction of generative pre-trained transformer (GPT) based large language models (LLMs) offers a potential avenue for the provision of complex information and clinical decision support. Purpose This study evaluates the quality of three different GPT LLMs in answering complex medical questions, including a state-of-the-art preview model that incorporates the German paediatric cardiology guidelines. Methods Seven paediatric cardiologists and paediatric cardiac surgeons generated 72 questions, including complex questions and medical cases with associated questions. The questions were categorized by difficulty and required knowledge (factual and experience-based or mostly experience-based). We prompted the questions to three LLMs: GPT 3.5, GPT 4 and a GPT 4 turbo preview. The GPT 4 turbo preview was customized by incorporating all guidelines from the German Society for Paediatric Cardiology by a retrieval function. Employing one complex instruction for all questions, we prompted the LLMs to provide precise and detailed expert-level responses. The responses from each model were evaluated by experts based on relevance, factual accuracy, severity of possible harm, completeness, superfluous content, and age-related appropriateness from 0 (very bad) to 7 (very good). Differences were calculated using the Kruskal-Wallis-test in SPSS Version 28. Results Our findings indicated a good performance of all models regarding the dimensions tested. The figures show the average ratings (Figure 1, Figure 2A) and highlight significant differences after Bonferroni correction in bold (Figure 2B). The GPT 4 turbo preview, including the retrieval of guidelines, provided significantly more relevant (average rating [AR] 5.94, meaning mostly relevant), accurate (AR 5.6, meaning between somewhat and mostly accurate) and complete (AR 5, meaning fairly complete) answers compared to GPT 3.5 and GPT 4. In terms of difficulty levels or the type of questions, there was no significant difference in rating. Relevance ratings were slightly better in factual questions (AR 5.7) than in those requiring more experience-based knowledge (AR 5,3). Although GPT4 had higher average scores compared to GPT 3.5 in all dimensions except superfluous content, the differences in rating were not statistically significant. All models had relevant difficulties considering the age-related aspects of the questions (AR 4.06-4.45, p=0.455). Conclusion This study highlights the potential and limitations of AI language models in addressing complex medical questions in fields characterized by highly individualized decision-making scenarios. The findings advocate for the development of more specialized AI tools in medicine, tailored to specific medical fields and patient age groups.Fig 1:Average ratings of LLMsFig 2:Rating differences between LLMs

  • Research Article
  • Cite Count Icon 1
  • 10.1016/j.jclinepi.2026.112221
Large language models show promising performance for some systematic review tasks but call for cautious implementation: a systematic review.
  • Jun 1, 2026
  • Journal of clinical epidemiology
  • Florian Laignelot + 9 more

With the exponential growth of biomedical literature, the challenge of conducting systematic reviews is becoming increasingly burdensome. We aimed to evaluate the performance of large language models (LLMs) in the automation of some or all steps of systematic reviews and meta-analyses. In this systematic review, we searched PubMed, Embase, the Cochrane Library and preprint platforms up to January 14, 2025. We included any studies assessing the performance of LLMs (eg, generative pre-trained transformer [GPT], Claude, Mistral) in any step of the systematic review process. Pairs of reviewers independently extracted data and assessed risk of bias. We conducted analyses using median (interquartile range [IQR]) for positive (PPA) and negative percent agreements (NPA), respectively, analogous to sensitivity and specificity, between LLMs and human reviewers. From 3889 unique references, we included 63 studies of which 52 reporting performance metrics for a total of 148 LLM performance assessments. Most assessments concerned GPT models (n = 114, 77%). The most frequently evaluated tasks were title and abstract screening (n = 78, 53%), data extraction (n = 23, 16%), and full-text screening (n = 20, 14%). For title and abstract screening, overall median PPA was 0.92 (IQR 0.69-0.98) and median NPA was 0.89 (0.72-0.95). For full-text screening, the overall median PPA was 0.93 (0.87-1.00) and median NPA was 0.92 (0.78-0.97). Late-generation LLMs released after GPT-4 seemed to provide higher performance than earlier models. For other tasks, authors reported overall good performances, but variability of performance metrics precluded complete quantitative synthesis. Global accuracy for data extraction tasks ranged from 0.36 to 1.00, with a median accuracy of 0.95 (IQR 0.91-0.97, n = 11). For the "risk of bias assessment" task, accuracy ranged from 0.44 to 0.90 (median = 0.62, IQR 0.53-0.76, n = 6). The performance of LLMs, particularly newer generations, shows promise in automating some repetitive steps of systematic reviews such as screening. However, their successful integration will require appropriate safeguards and careful implementation. Systematic reviews are one of the most reliable ways to answer medical and public health questions. They bring together all available studies on a topic and help clinicians and policymakers make informed decisions. However, producing a high-quality systematic review takes a lot of time and effort. Whole teams of researchers spend months screening thousands of articles, extracting data, and double-checking results. With little more than a million of new publications every year, keeping reviews up to date is becoming increasingly difficult. LLMs, such as ChatGPT, may help reduce this workload. These tools can read and summarize text and might assist with repetitive tasks like selecting relevant studies or extracting information from articles. But it is still unclear how reliable these tools are for research purposes. This is the first systematic review to assess LLMs' performance to facilitate systematic reviews. We sought to review all studies that tested LLMs in the different steps of systematic reviews and found 63 studies evaluating how well these tools performed compared with human reviewers. Overall, LLMs showed good agreement with humans for tasks such as screening titles and abstracts, and full-text articles. Newer models seemed to perform better than older ones. However, performance was more variable for complex tasks that require interpretation, such as extracting detailed data or assessing methodological quality. Our findings suggest that LLMs could help researchers work faster and make systematic reviews more efficient. However, they are not ready to replace human judgment. These tools can make mistakes, produce inconsistent results, or generate inaccurate information if not carefully supervised. In practice, LLMs should be used as assistants rather than substitutes. With proper safeguards, transparent reporting, and human oversight, they may become valuable tools to support evidence-based healthcare and help keep research up to date.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant