Medical large language models and systems in the clinical application of spinal diseases: Current status, challenges, and future prospects.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Medical large language models and systems in the clinical application of spinal diseases: Current status, challenges, and future prospects.

Similar Papers
  • Research Article
  • Cite Count Icon 1
  • 10.1101/2025.02.27.640661
SensitiveCancerGPT: Leveraging Generative Large Language Model on Structured Omics Data to Optimize Drug Sensitivity Prediction.
  • Mar 3, 2025
  • bioRxiv : the preprint server for biology
  • Shaika Chowdhury + 6 more

The fast accumulation of vast pharmacogenomics data of cancer cell lines provide unprecedented opportunities for drug sensitivity prediction (DSP), a crucial prerequisite for the advancement of precision oncology. Recently, Generative Large Language Models (LLM) have demonstrated performance and generalization prowess across diverse tasks in the field of natural language processing (NLP). However, the structured format of the pharmacogenomics data poses challenge for the utility of LLM in DSP. Therefore, the objective of this study is multi-fold: to adapt prompt engineering for structured pharmacogenomics data toward optimizing LLM's DSP performance, to evaluate LLM's generalization in real-world DSP scenarios, and to compare LLM's DSP performance against that of state-of-the-science baselines. We systematically investigated the capability of the Generative Pre-trained Transformer (GPT) as a DSP model on four publicly available benchmark pharmacogenomics datasets, which are stratified by five cancer tissue types of cell lines and encompass both oncology and non-oncology drugs. Essentially, the predictive landscape of GPT is assessed for effectiveness on the DSP task via four learning paradigms: zero-shot learning, few-shot learning, fine-tuning and clustering pretrained embeddings. To facilitate GPT in seamlessly processing the structured pharmacogenomics data, domain-specific novel prompt engineering is employed by implementing three prompt templates (i.e., Instruction, Instruction-Prefix, Cloze) and integrating pharmacogenomics-related features into the prompt. We validated GPT's performance in diverse real-world DSP scenarios: cross-tissue generalization, blind tests, and analyses of drug-pathway associations and top sensitive/resistant cell lines. Furthermore, we conducted a comparative evaluation of GPT against multiple Transformer-based pretrained models and existing DSP baselines. Extensive experiments on the pharmacogenomics datasets across the five tissue cohorts demonstrate that fine-tuning GPT yields the best DSP performance (28% F1 increase, p-value= 0.0003) followed by clustering pretrained GPT embeddings (26% F1 increase, p-value= 0.0005), outperforming GPT in-context learning (i.e., few-shot). However, GPT in the zero-shot setting had a big F1 gap, resulting in the worst performance. Within the scope of prompt engineering, performance enhancement was achieved by directly instructing GPT about the DSP task and resorting to a concise context format (i.e., instruction-prefix), leading to F1 performance gain of 22% (p-value=0.02); while incorporation of drug-cell line prompt context derived from genomics and/or molecular features further boosted F1 score by 2%. Compared to state-of-the-science DSP baselines, GPT significantly asserted superior mean F1 performance (16% gain, p-value<0.05) on the GDSC dataset. In the cross-tissue analysis, GPT showcased comparable generalizability to the within-tissue performances on the GDSC and PRISM datasets, while statistically significant F1 performance improvements on the CCLE (8%, p-value=0.001) and DrugComb (19%, p-value=0.009) datasets. Evaluation on the challenging blind tests suggests GPT's competitiveness on the CCLE and DrugComb datasets compared to random splitting. Furthermore, analyses of the drug-pathway associations and log probabilities provided valuable insights that align with previous DSP findings. The diverse experiment setups and in-depth analysis underscore the importance of generative LLM, such as GPT, as a viable in silico approach to guide precision oncology. https://github.com/bioIKEA/SensitiveCancerGPT.

  • Discussion
  • 10.14245/ns.2448236.118
Commentary on “Performance of a Large Language Model in the Generation of Clinical Guidelines for Antibiotic Prophylaxis in Spine Surgery”
  • Mar 1, 2024
  • Neurospine
  • Sun-Ho Lee

The introduction of artificial intelligence (AI), particularly large language models (LLMs) such as the generative pre-trained transformer (GPT) series into the medical field has heralded a new era of data-driven medicine. AI's capacity for processing vast datasets has enabled the development of predictive models that can forecast patient outcomes with remarkable accuracy. LLMs like GPT and its successors have demonstrated an ability to understand and generate human-like text, facilitating their application in medical documentation, patient interaction, and even in generating diagnostic reports from patient data and imaging findings. Over the past 10 years, the development of AI, LLMs, and GPTs has significantly impacted the field of neurosurgery and spinal care as well. [1] [2] [3] [4] [5] Zaidat et al. 6 studied performance of a LLM in the generation of clinical guidelines for antibiotic prophylaxis in spine surgery. This study delves into the capabilities of ChatGPT's models, GPT-3.5 and GPT-4.0, showcasing their potential to streamline medical processes. They suggest that GPT-3.5's ability to generate clinically relevant antibiotic use guidelines for spinal surgery is commendable; however, its limitations, such as the inability to discern the most crucial aspects of the guidelines, redundancy, fabrication of citations, and inconsistency, pose significant barriers to its practical application. GPT-4.0, on the other hand, demonstrates a marked improvement in response accuracy and the ability to cite authoritative guidelines, such as those from the North American Spine Society (NASS). This model's enhanced performance, including a 20% increase in response accuracy and the ability to cite the NASS guideline in over 60% of responses, suggests a more reliable tool for clinicians seeking to integrate AI-generated content into their practice. However, the study's findings also highlight the inherent unpredictability of LLM responses and the potential for "artificial hallucination, " where models generate spurious statements without a solid basis in their training data. This phenomenon raises concerns about the ethical implications of using LLMs in clinical settings, particularly regarding patient care and liability. The possibility of LLMs providing inaccurate responses, especially when prompted for medical advice, necessitates a cautious approach to their deployment. We also pay attention to the limitations of the study itself, including the outdated nature of the NASS guidelines, which have not been updated since 2013, and the potential biases and gaps in the medical knowledge contained within the LLMs' training data. These factors highlight the im-Neurospine

  • Research Article
  • Cite Count Icon 70
  • 10.1016/j.jclinepi.2025.111746
Large language models for conducting systematic reviews: on the rise, but not yet ready for use-a scoping review.
  • May 1, 2025
  • Journal of clinical epidemiology
  • Judith-Lisa Lieberum + 8 more

Machine learning promises versatile help in the creation of systematic reviews (SRs). Recently, further developments in the form of large language models (LLMs) and their application in SR conduct attracted attention. We aimed at providing an overview of LLM applications in SR conduct in health research. We systematically searched MEDLINE, Web of Science, IEEEXplore, ACM Digital Library, Europe PMC (preprints), Google Scholar, and conducted an additional hand search (last search: February 26, 2024). We included scientific articles in English or German, published from April 2021 onwards, building upon the results of a mapping review that has not yet identified LLM applications to support SRs. Two reviewers independently screened studies for eligibility; after piloting, 1 reviewer extracted data, checked by another. Our database search yielded 8054 hits, and we identified 33 articles from our hand search. We finally included 37 articles on LLM support. LLM approaches covered 10 of 13 defined SR steps, most frequently literature search (n = 15, 41%), study selection (n = 14, 38%), and data extraction (n = 11, 30%). The mostly recurring LLM was Generative Pretrained Transformer (GPT) (n = 33, 89%). Validation studies were predominant (n = 21, 57%). In half of the studies, authors evaluated LLM use as promising (n = 20, 54%), one-quarter as neutral (n = 9, 24%) and one-fifth as nonpromising (n = 8, 22%). Although LLMs show promise in supporting SR creation, fully established or validated applications are often lacking. The rapid increase in research on LLMs for evidence synthesis production highlights their growing relevance. Systematic reviews are a crucial tool in health research where experts carefully collect and analyze all available evidence on a specific research question. Creating these reviews is typically time- and resource-intensive, often taking months or even years to complete, as researchers must thoroughly search, evaluate, and synthesize an immense number of scientific studies. For the present article, we conducted a review to understand how new artificial intelligence (AI) tools, specifically large language models (LLMs) like Generative Pretrained Transformer (GPT), can be used to help create systematic reviews in health research. We searched multiple scientific databases and finally found 37 relevant articles. We found that LLMs have been tested to help with various parts of the systematic review process, particularly in 3 main areas: searching scientific literature (41% of studies), selecting relevant studies (38%), and extracting important information from these studies (30%). GPT was the most commonly used LLM, appearing in 89% of the studies. Most of the research (57%) focused on testing whether these AI tools actually work as intended in this context of systematic review production. The results were mixed: about half of the studies found LLMs promising, a quarter were neutral, and one-fifth found them not promising. While LLMs show potential for making the systematic review process more efficient, there is still a lack of fully tested and validated applications. However, the increasing number of studies in this field suggests that these AI tools are becoming increasingly important in creating systematic reviews.

  • Abstract
  • 10.1136/jnis-2024-snis.290
E-185 Customized generative pretrained transformer for simplified patient education of carotid angioplasty and stenting: a feasibility study
  • Jul 1, 2024
  • Journal of NeuroInterventional Surgery
  • A Brake + 3 more

Maintaining patient autonomy necessitates a clear understanding of surgical procedures prior to consent. Time constraints, patient literacy, and the complexity of medical terminology pose challenges in conveying this information. Recent...

  • Research Article
  • Cite Count Icon 1
  • 10.1093/eurheartj/ehae666.3491
A guideline-informed language model for paediatric cardiology demonstrates high performance in answering complex medical questions
  • Oct 28, 2024
  • European Heart Journal
  • T Uden + 13 more

Background Paediatric cardiology presents unique challenges with its diverse and complex cases, limited evidence base, and the necessity for multi-expert involvement in decision-making processes. In this context, the introduction of generative pre-trained transformer (GPT) based large language models (LLMs) offers a potential avenue for the provision of complex information and clinical decision support. Purpose This study evaluates the quality of three different GPT LLMs in answering complex medical questions, including a state-of-the-art preview model that incorporates the German paediatric cardiology guidelines. Methods Seven paediatric cardiologists and paediatric cardiac surgeons generated 72 questions, including complex questions and medical cases with associated questions. The questions were categorized by difficulty and required knowledge (factual and experience-based or mostly experience-based). We prompted the questions to three LLMs: GPT 3.5, GPT 4 and a GPT 4 turbo preview. The GPT 4 turbo preview was customized by incorporating all guidelines from the German Society for Paediatric Cardiology by a retrieval function. Employing one complex instruction for all questions, we prompted the LLMs to provide precise and detailed expert-level responses. The responses from each model were evaluated by experts based on relevance, factual accuracy, severity of possible harm, completeness, superfluous content, and age-related appropriateness from 0 (very bad) to 7 (very good). Differences were calculated using the Kruskal-Wallis-test in SPSS Version 28. Results Our findings indicated a good performance of all models regarding the dimensions tested. The figures show the average ratings (Figure 1, Figure 2A) and highlight significant differences after Bonferroni correction in bold (Figure 2B). The GPT 4 turbo preview, including the retrieval of guidelines, provided significantly more relevant (average rating [AR] 5.94, meaning mostly relevant), accurate (AR 5.6, meaning between somewhat and mostly accurate) and complete (AR 5, meaning fairly complete) answers compared to GPT 3.5 and GPT 4. In terms of difficulty levels or the type of questions, there was no significant difference in rating. Relevance ratings were slightly better in factual questions (AR 5.7) than in those requiring more experience-based knowledge (AR 5,3). Although GPT4 had higher average scores compared to GPT 3.5 in all dimensions except superfluous content, the differences in rating were not statistically significant. All models had relevant difficulties considering the age-related aspects of the questions (AR 4.06-4.45, p=0.455). Conclusion This study highlights the potential and limitations of AI language models in addressing complex medical questions in fields characterized by highly individualized decision-making scenarios. The findings advocate for the development of more specialized AI tools in medicine, tailored to specific medical fields and patient age groups.Fig 1:Average ratings of LLMsFig 2:Rating differences between LLMs

  • Abstract
  • Cite Count Icon 3
  • 10.1182/blood-2023-185854
Evaluating the Performance of Large Language Models in Hematopoietic Stem Cell Transplantation Decision Making
  • Nov 2, 2023
  • Blood
  • Ivan Civettini + 14 more

Evaluating the Performance of Large Language Models in Hematopoietic Stem Cell Transplantation Decision Making

  • Research Article
  • Cite Count Icon 37
  • 10.1287/msom.2023.0279
A Manager and an AI Walk into a Bar: Does ChatGPT Make Biased Decisions Like We Do?
  • Jan 31, 2025
  • Manufacturing &amp; Service Operations Management
  • Yang Chen + 4 more

Problem definition: Large language models (LLMs) are being increasingly leveraged in business and consumer decision-making processes. Because LLMs learn from human data and feedback, which can be biased, determining whether LLMs exhibit human-like behavioral decision biases (e.g., base-rate neglect, risk aversion, confirmation bias, etc.) is crucial prior to implementing LLMs into decision-making contexts and workflows. To understand this, we examine 18 common human biases that are important in operations management (OM) using the dominant LLM, ChatGPT. Methodology/results: We perform experiments where GPT-3.5 and GPT-4 act as participants to test these biases using vignettes adapted from the literature (“standard context”) and variants reframed in inventory and general OM contexts. In almost half of the experiments, Generative Pre-trained Transformer (GPT) mirrors human biases, diverging from prototypical human responses in the remaining experiments. We also observe that GPT models have a notable level of consistency between the standard and OM-specific experiments as well as across temporal versions of the GPT-3.5 model. Our comparative analysis between GPT-3.5 and GPT-4 reveals a dual-edged progression of GPT’s decision making, wherein GPT-4 advances in decision-making accuracy for problems with well-defined mathematical solutions while simultaneously displaying increased behavioral biases for preference-based problems. Managerial implications: First, our results highlight that managers will obtain the greatest benefits from deploying GPT to workflows leveraging established formulas. Second, that GPT displayed a high level of response consistency across the standard, inventory, and non-inventory operational contexts provides optimism that LLMs can offer reliable support even when details of the decision and problem contexts change. Third, although selecting between models, like GPT-3.5 and GPT-4, represents a trade-off in cost and performance, our results suggest that managers should invest in higher-performing models, particularly for solving problems with objective solutions. Funding: This work was supported by the Social Sciences and Humanities Research Council of Canada [Grant SSHRC 430-2019-00505]. The authors also gratefully acknowledge the Smith School of Business at Queen’s University for providing funding to support Y. Chen’s postdoctoral appointment. Supplemental Material: The online appendix is available at https://doi.org/10.1287/msom.2023.0279 .

  • Research Article
  • Cite Count Icon 21
  • 10.1016/j.joms.2024.11.007
Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential
  • Mar 1, 2025
  • Journal of Oral and Maxillofacial Surgery
  • Reema Mahmoud + 5 more

Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential

  • Research Article
  • 10.1164/ajrccm.2025.211.abstracts.a2086
Leveraging Generative Pre-Trained Transformer (GPT) Large Language Models (LLMs) for Interstitial Lung Diseases (ILD) Clinical Research
  • May 1, 2025
  • American Journal of Respiratory and Critical Care Medicine
  • S Chen + 4 more

Rationale: The majority of clinically relevant data is contained in unstructured text such as clinical notes. ILD notes are particularly prone to verbosity and imprecision, making structured data extraction a major bottleneck for clinical research and a costly endeavor when maintaining ILD registries and databases. In this study, we explore the utility and performance characteristics of current GPT LLMs for extracting structured binary data from unstructured clinical text. Methods: We used GPT-3.5-turbo, GPT-4, GPT-4o, GPT-4o-mini and GPT-o1-mini models using a protected health information (PHI) compliant API pipeline and Llama 3.3-70B, Llama 3.1-8B and Mistral-7b-0.3 running locally on GPUs. Patients who were initially seen in the Stanford ILD clinic between 2018 to 2022, with at least three follow-up visits were selected for this study. Prompt engineering was performed using notes randomly selected from 10 patients of this cohort. Various prompt types (simple, heuristic, and chain of thought) were tested and those with best performance were selected. 100 randomly selected clinic notes from the patient cohort that had not been previously used in the prompt engineering stage were then processed through the LLM pipeline. Three ILD physicians independently scored the same notes and prompts. The discordant answers were reviewed by two ILD physicians to come to a consensus answer, which was considered the ground truth for this study. In addition to answering questions, LLMs can also provide a measure of its confidence in the answer. We used the logprobs parameter to assess the LLM's confidence in its answer, and compared the accuracy for answers in which the LLM was confident versus not. Results: It took approximately 1-2 seconds to process each clinical note-prompt combination. The three ILD physicians’ mean accuracy was 96%, similar to that of GPT-4, GPT-4o, GPT-4o-mini and GPT-o1-mini (Table 1). GPT 3.5, Llama 3.3-70B, Llama 3.1-8B and Mistral-7b-0.3 ‘s accuracy was much lower, to the point of not being clinically useful. The LLMs were substantially less accurate for questions where the LLM's level of confidence was low (&amp;lt;80%). Conclusions: GPT LLMs demonstrate human level accuracy while being orders of magnitude faster for extracting structured binary data from unstructured ILD clinical data. Using the LLM's confidence in the answer can further identify records requiring human review, allowing for additional improvement in accuracy. Incorporating these GPT LLMs in ILD clinical research has the potential to dramatically accelerate data-driven insights, streamline research workflows, and elevate clinical research to a new level.

  • Research Article
  • Cite Count Icon 4
  • 10.2196/67914
Evaluation of Large Language Models in Tailoring Educational Content for Cancer Survivors and Their Caregivers: Quality Analysis
  • Apr 7, 2025
  • JMIR Cancer
  • Darren Liu + 10 more

BackgroundCancer survivors and their caregivers, particularly those from disadvantaged backgrounds with limited health literacy or racial and ethnic minorities facing language barriers, are at a disproportionately higher risk of experiencing symptom burdens from cancer and its treatments. Large language models (LLMs) offer a promising avenue for generating concise, linguistically appropriate, and accessible educational materials tailored to these populations. However, there is limited research evaluating how effectively LLMs perform in creating targeted content for individuals with diverse literacy and language needs.ObjectiveThis study aimed to evaluate the overall performance of LLMs in generating tailored educational content for cancer survivors and their caregivers with limited health literacy or language barriers, compare the performances of 3 Generative Pretrained Transformer (GPT) models (ie, GPT-3.5 Turbo, GPT-4, and GPT-4 Turbo; OpenAI), and examine how different prompting approaches influence the quality of the generated content.MethodsWe selected 30 topics from national guidelines on cancer care and education. GPT-3.5 Turbo, GPT-4, and GPT-4 Turbo were used to generate tailored content of up to 250 words at a 6th-grade reading level, with translations into Spanish and Chinese for each topic. Two distinct prompting approaches (textual and bulleted) were applied and evaluated. Nine oncology experts evaluated 360 generated responses based on predetermined criteria: word limit, reading level, and quality assessment (ie, clarity, accuracy, relevance, completeness, and comprehensibility). ANOVA (analysis of variance) or chi-square analyses were used to compare differences among the various GPT models and prompts.ResultsOverall, LLMs showed excellent performance in tailoring educational content, with 74.2% (267/360) adhering to the specified word limit and achieving an average quality assessment score of 8.933 out of 10. However, LLMs showed moderate performance in reading level, with 41.1% (148/360) of content failing to meet the sixth-grade reading level. LLMs demonstrated strong translation capabilities, achieving an accuracy of 96.7% (87/90) for Spanish and 81.1% (73/90) for Chinese translations. Common errors included imprecise scopes, inaccuracies in definitions, and content that lacked actionable recommendations. The more advanced GPT-4 family models showed better overall performance compared to GPT-3.5 Turbo. Prompting GPTs to produce bulleted-format content was likely to result in better educational content compared with textual-format content.ConclusionsAll 3 LLMs demonstrated high potential for delivering multilingual, concise, and low health literacy educational content for cancer survivors and caregivers who face limited literacy or language barriers. GPT-4 family models were notably more robust. While further refinement is required to ensure simpler reading levels and fully comprehensive information, these findings highlight LLMs as an emerging tool for bridging gaps in cancer education and advancing health equity. Future research should integrate expert feedback, additional prompt engineering strategies, and specialized training data to optimize content accuracy and accessibility.

  • Research Article
  • 10.1177/13621688261415584
Can ChatGPT score ESL writing? A correlation analysis between teacher and GenAI scores
  • Feb 26, 2026
  • Language Teaching Research
  • Anas Alkhofi

Large language models (LLMs) have recently gained attention in automated writing evaluation (AWE) due to their flexibility, ease of use, and free accessibility. However, most existing studies have relied on standardized rubrics and detailed scoring guidelines to guide model outputs. Recent evidence suggests that LLMs can adapt their scoring behavior through example-based calibration. Building on this insight, the present study examines whether ChatGPT-4o can mirror individual instructors’ evaluative tendencies. Data consisted of 100 previously graded final exam writing samples from Saudi students of English as a second language (ESL), provided by five instructors at a Saudi university’s Bachelor of Arts program. GPT (generative pre-trained transformer) was calibrated using instructor-graded writing samples to enhance its alignment with human grading criteria. Subsequent analysis involved 82 samples, excluding those used in calibration. Results revealed a strong positive and statistically significant correlation ( r = 0.816, p &lt; .001) between GPT scores and teacher-assigned scores. Descriptive analyses further indicated differential scoring tendencies: GPT was more generous toward lower-quality writings, assigning higher mean scores than human raters, whereas teachers tended to award higher scores than GPT for high-quality writings. These findings suggest that GPT, particularly when effectively calibrated, can mirror teacher grading practices, though with notable differences at performance extremes. Consequently, this study highlights GPT’s potential as a complementary assessment tool in ESL writing instruction.

  • Research Article
  • 10.1016/j.jclinepi.2026.112221
Large language models show promising performance for some systematic review tasks but call for cautious implementation: a systematic review.
  • Mar 12, 2026
  • Journal of clinical epidemiology
  • Florian Laignelot + 9 more

With the exponential growth of biomedical literature, the challenge of conducting systematic reviews is becoming increasingly burdensome. We aimed to evaluate the performance of large language models (LLMs) in the automation of some or all steps of systematic reviews and meta-analyses. In this systematic review, we searched PubMed, Embase, the Cochrane Library and preprint platforms up to January 14, 2025. We included any studies assessing the performance of LLMs (eg, generative pre-trained transformer [GPT], Claude, Mistral) in any step of the systematic review process. Pairs of reviewers independently extracted data and assessed risk of bias. We conducted analyses using median (interquartile range [IQR]) for positive (PPA) and negative percent agreements (NPA), respectively, analogous to sensitivity and specificity, between LLMs and human reviewers. From 3889 unique references, we included 63 studies of which 52 reporting performance metrics for a total of 148 LLM performance assessments. Most assessments concerned GPT models (n = 114, 77%). The most frequently evaluated tasks were title and abstract screening (n = 78, 53%), data extraction (n = 23, 16%), and full-text screening (n = 20, 14%). For title and abstract screening, overall median PPA was 0.92 (IQR 0.69-0.98) and median NPA was 0.89 (0.72-0.95). For full-text screening, the overall median PPA was 0.93 (0.87-1.00) and median NPA was 0.92 (0.78-0.97). Late-generation LLMs released after GPT-4 seemed to provide higher performance than earlier models. For other tasks, authors reported overall good performances, but variability of performance metrics precluded complete quantitative synthesis. Global accuracy for data extraction tasks ranged from 0.36 to 1.00, with a median accuracy of 0.95 (IQR 0.91-0.97, n = 11). For the "risk of bias assessment" task, accuracy ranged from 0.44 to 0.90 (median = 0.62, IQR 0.53-0.76, n = 6). The performance of LLMs, particularly newer generations, shows promise in automating some repetitive steps of systematic reviews such as screening. However, their successful integration will require appropriate safeguards and careful implementation. Systematic reviews are one of the most reliable ways to answer medical and public health questions. They bring together all available studies on a topic and help clinicians and policymakers make informed decisions. However, producing a high-quality systematic review takes a lot of time and effort. Whole teams of researchers spend months screening thousands of articles, extracting data, and double-checking results. With little more than a million of new publications every year, keeping reviews up to date is becoming increasingly difficult. LLMs, such as ChatGPT, may help reduce this workload. These tools can read and summarize text and might assist with repetitive tasks like selecting relevant studies or extracting information from articles. But it is still unclear how reliable these tools are for research purposes. This is the first systematic review to assess LLMs' performance to facilitate systematic reviews. We sought to review all studies that tested LLMs in the different steps of systematic reviews and found 63 studies evaluating how well these tools performed compared with human reviewers. Overall, LLMs showed good agreement with humans for tasks such as screening titles and abstracts, and full-text articles. Newer models seemed to perform better than older ones. However, performance was more variable for complex tasks that require interpretation, such as extracting detailed data or assessing methodological quality. Our findings suggest that LLMs could help researchers work faster and make systematic reviews more efficient. However, they are not ready to replace human judgment. These tools can make mistakes, produce inconsistent results, or generate inaccurate information if not carefully supervised. In practice, LLMs should be used as assistants rather than substitutes. With proper safeguards, transparent reporting, and human oversight, they may become valuable tools to support evidence-based healthcare and help keep research up to date.

  • Research Article
  • Cite Count Icon 10
  • 10.34190/iccws.19.1.2096
Enhancing Privacy and Security in Large-Language Models: A Zero-Knowledge Proof Approach
  • Mar 21, 2024
  • International Conference on Cyber Warfare and Security
  • Shridhar Singh

The explosive growth of Large-Language Models (LLMs), particularly Generative Pre-trained Transformer (GPT) models, has revolutionised fields ranging from natural language processing to creative writing. Yet, their reliance on vast, often unverified data sources introduces a critical vulnerability: unreliability and security concerns. Traditional GPT models, while impressive in their capabilities, struggle with limited factual accuracy and susceptibility to manipulation by biased or malicious data. This poses a significant risk in professional and personal environments where sensitive or mission-critical data is paramount. This work tackles this challenge head-on by proposing a novel approach to enhance GPT security and reliability: leveraging Zero-Knowledge Proofs (ZKPs). Unlike traditional cryptographic methods that require sensitive data exchange, ZKPs allow one party to convincingly prove the truth of a statement, without revealing the underlying information. In the context of GPTs, ZKPs can validate the legitimacy and quality of data sources used in GPT computations, combating data manipulation and misinformation. This ensures trustworthy outputs, even when incorporating third-party data (TPD). ZKPs can securely verify user identities and access privileges, preventing unauthorised access to sensitive data and functionality. This protects critical information and promotes responsible LLM usage. ZKPs can identify and filter out manipulative prompts designed to elicit harmful or biased responses from GPTs. This safeguards against malicious actors and promotes ethical LLM development. ZKPs facilitate training specialised GPT models on targeted datasets, resulting in deeper understanding and more accurate outputs within specific domains. This allows the creation of ‘expert-GPT’ applications in specialised fields like healthcare, finance, and legal services. The integration of ZKPs into GPT models represents a crucial step towards overcoming trust and security barriers. Our research demonstrates the viability and efficacy of this approach, with our ZKP-based authentication system achieving promising results in data verification, user control, and malicious prompt detection. These findings lay the groundwork for a future where GPTs, empowered by ZKPs, operate with unwavering integrity, fostering trust and accelerating ethical AI development across diverse domains.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 124
  • 10.1001/jamanetworkopen.2024.8895
Use of a Large Language Model to Assess Clinical Acuity of Adults in the Emergency Department
  • May 7, 2024
  • JAMA Network Open
  • Christopher Y K Williams + 6 more

The introduction of large language models (LLMs), such as Generative Pre-trained Transformer 4 (GPT-4; OpenAI), has generated significant interest in health care, yet studies evaluating their performance in a clinical setting are lacking. Determination of clinical acuity, a measure of a patient's illness severity and level of required medical attention, is one of the foundational elements of medical reasoning in emergency medicine. To determine whether an LLM can accurately assess clinical acuity in the emergency department (ED). This cross-sectional study identified all adult ED visits from January 1, 2012, to January 17, 2023, at the University of California, San Francisco, with a documented Emergency Severity Index (ESI) acuity level (immediate, emergent, urgent, less urgent, or nonurgent) and with a corresponding ED physician note. A sample of 10 000 pairs of ED visits with nonequivalent ESI scores, balanced for each of the 10 possible pairs of 5 ESI scores, was selected at random. The potential of the LLM to classify acuity levels of patients in the ED based on the ESI across 10 000 patient pairs. Using deidentified clinical text, the LLM was queried to identify the patient with a higher-acuity presentation within each pair based on the patients' clinical history. An earlier LLM was queried to allow comparison with this model. Accuracy score was calculated to evaluate the performance of both LLMs across the 10 000-pair sample. A 500-pair subsample was manually classified by a physician reviewer to compare performance between the LLMs and human classification. From a total of 251 401 adult ED visits, a balanced sample of 10 000 patient pairs was created wherein each pair comprised patients with disparate ESI acuity scores. Across this sample, the LLM correctly inferred the patient with higher acuity for 8940 of 10 000 pairs (accuracy, 0.89 [95% CI, 0.89-0.90]). Performance of the comparator LLM (accuracy, 0.84 [95% CI, 0.83-0.84]) was below that of its successor. Among the 500-pair subsample that was also manually classified, LLM performance (accuracy, 0.88 [95% CI, 0.86-0.91]) was comparable with that of the physician reviewer (accuracy, 0.86 [95% CI, 0.83-0.89]). In this cross-sectional study of 10 000 pairs of ED visits, the LLM accurately identified the patient with higher acuity when given pairs of presenting histories extracted from patients' first ED documentation. These findings suggest that the integration of an LLM into ED workflows could enhance triage processes while maintaining triage quality and warrants further investigation.

  • Research Article
  • 10.7759/cureus.99472
Evaluating Multimodal Large Language Model (LLM) (Generative Pre-trained Transformer 5 (GPT-5)) for Meniscal Tear Detection on Knee Magnetic Resonance Imaging (MRI): A Pilot Study
  • Dec 17, 2025
  • Cureus
  • Kwan Kit Chan + 2 more

IntroductionMagnetic resonance imaging (MRI) of the knee is the gold standard for evaluating meniscal injuries. While specialized artificial intelligence (AI) models have demonstrated high diagnostic capability in detecting meniscal tears, the performance of general-purpose large language models (LLMs) with multimodal vision capabilities remains underexplored. Previous iterations, such as generative pre-trained transformer 4 (GPT-4) (OpenAI, San Francisco, CA, USA) with vision, have shown limited success in direct musculoskeletal image interpretation. This study evaluates the diagnostic performance of the latest-generation LLM, generative pre-trained transformer 5 (GPT-5), in detecting meniscal tears on knee MRI.ObjectivesThis study aimed to evaluate the diagnostic performance of GPT-5 (a general-purpose multimodal LLM) in detecting meniscal tears on knee MRI in a zero-shot setting, using a publicly available dataset.Materials and methodsOne hundred knee MRI examinations (50 with meniscal tears, 50 without) were randomly selected from the MRNet validation dataset, with ground-truth labels extracted from the dataset. Sagittal T2-weighted and coronal T1-weighted series were reviewed for completeness and image quality and then converted to Portable Network Graphics (PNG) slices. GPT-5 (gpt-5-2025-08-07) analyzed each case in zero-shot fashion using a fixed prompt requesting a binary ("yes/no") determination of meniscal tear presence without any clinical context. Model predictions were compared with ground truth, and accuracy, precision, recall, specificity, and F1-scores were calculated with 95% confidence intervals.ResultsGPT-5 achieved an overall accuracy of 76% (95% CI: 0.668-0.833). The model demonstrated a sensitivity (recall) of 84% (95% CI: 0.715-0.917) and a specificity of 68% (95% CI: 0.542-0.792). The precision for detecting tears was 72.4%, and the F1-score was 0.778.ConclusionIn this pilot study, GPT-5 demonstrates potential in the zero-shot interpretation of knee MRIs for meniscal tear detection, outperforming previous multimodal LLMs. However, the results should be interpreted with caution due to study limitations, and clinical utility is currently limited by a high false-positive rate and lack of visual explainability. Nevertheless, this pilot evaluation provides an initial proof of concept, and with larger datasets, rigorous validation, improved calibration, and enhanced explainability, future multimodal LLMs may evolve into supportive, human-in-the-loop tools in musculoskeletal radiology.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant