Enhancing Accuracy of LLM in Nursing Education Through RAG and Thought Chains.
The penetration of large language models (LLMs) into all walks of life requires finding effective ways to improve the accuracy of their practical application in nursing scenarios. This study investigates the potential of retrieval-augmented generation (RAG) technology and the chain of thought (CoT) reasoning process in addressing the limitations of LLMs in professional knowledge and complex problem reasoning. Leveraging a knowledge base derived from the Chinese National Nursing Licensure Examination question bank, researchers first evaluated the baseline performance of LLMs. Subsequently, the CoT reasoning process was systematically compared with official exam parsing methods to assess the model's ability to interpret nursing-related questions. Experimental results demonstrated that integrating the knowledge base significantly improved LLM accuracy from 84.58% to 93.33%. Furthermore, the CoT reasoning process achieved a 91.33% accuracy rate in parsing question options, highlighting its robust logical reasoning capabilities. These findings underscore that the synergistic integration of RAG and CoT enhances the precision of LLMs in knowledge retrieval and clinical reasoning, offering an innovative technical pathway for developing intelligent nursing education tools. The study not only validates the effectiveness of combining knowledge augmentation with advanced reasoning mechanisms but also provides methodological insights for improving the reliability of AI applications in health care.
- Research Article
18
- 10.1002/ohn.864
- Jun 19, 2024
- Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery
The recent surge in popularity of large language models (LLMs), such as ChatGPT, has showcased their proficiency in medical examinations and potential applications in health care. However, LLMs possess inherent limitations, including inconsistent accuracy, specific prompting requirements, and the risk of generating harmful hallucinations. A domain-specific model might address these limitations effectively. Developmental design. Virtual. Otolaryngology-head and neck surgery (OHNS) relevant data were systematically gathered from open-access Internet sources and indexed into a knowledge database. We leveraged Retrieval-Augmented Language Modeling to recall this information and utilized it for pretraining, which was then integrated into ChatGPT4.0, creating an OHNS-specific knowledge question & answer platform known as ChatENT. The model is further tested on different types of questions. ChatENT showed enhanced performance in the analysis and interpretation of OHNS information, outperforming ChatGPT4.0 in both the Canadian Royal College OHNS sample examination questions challenge and the US board practice questions challenge, with a 58.4% and 26.0% error reduction, respectively. ChatENT generated fewer hallucinations and demonstrated greater consistency. To the best of our knowledge, ChatENT is the first specialty-specific knowledge retrieval artificial intelligence inthe medical field that utilizes the latest LLM. It appears to have considerable promise in areas such as medical education, patient education, and clinical decision support. The model has demonstrated the capacity to overcome the limitations of existing LLMs, thereby signaling a future of more precise, safe, and user-friendly applications in the realm of OHNS and other medical fields.
- Research Article
83
- 10.1001/jamanetworkopen.2023.46721
- Dec 7, 2023
- JAMA network open
Recent advancements in large language models (LLMs) have shown potential in a wide array of applications, including health care. While LLMs showed heterogeneous results across specialized medical board examinations, the performance of these models in neurology board examinations remains unexplored. To assess the performance of LLMs on neurology board-style examinations. This cross-sectional study was conducted between May 17 and May 31, 2023. The evaluation utilized a question bank approved by the American Board of Psychiatry and Neurology and was validated with a small question cohort by the European Board for Neurology. All questions were categorized into lower-order (recall, understanding) and higher-order (apply, analyze, synthesize) questions based on the Bloom taxonomy for learning and assessment. Performance by LLM ChatGPT versions 3.5 (LLM 1) and 4 (LLM 2) was assessed in relation to overall scores, question type, and topics, along with the confidence level and reproducibility of answers. Overall percentage scores of 2 LLMs. LLM 2 significantly outperformed LLM 1 by correctly answering 1662 of 1956 questions (85.0%) vs 1306 questions (66.8%) for LLM 1. Notably, LLM 2's performance was greater than the mean human score of 73.8%, effectively achieving near-passing and passing grades in the neurology board examination. LLM 2 outperformed human users in behavioral, cognitive, and psychological-related questions and demonstrated superior performance to LLM 1 in 6 categories. Both LLMs performed better on lower-order than higher-order questions, with LLM 2 excelling in both lower-order and higher-order questions. Both models consistently used confident language, even when providing incorrect answers. Reproducible answers of both LLMs were associated with a higher percentage of correct answers than inconsistent answers. Despite the absence of neurology-specific training, LLM 2 demonstrated commendable performance, whereas LLM 1 performed slightly below the human average. While higher-order cognitive tasks were more challenging for both models, LLM 2's results were equivalent to passing grades in specialized neurology examinations. These findings suggest that LLMs could have significant applications in clinical neurology and health care with further refinements.
- Research Article
- 10.1001/jamanetworkopen.2025.49963
- Dec 19, 2025
- JAMA Network Open
Large language models (LLMs) are increasingly integrated into health care applications; however, their vulnerability to prompt-injection attacks (ie, maliciously crafted inputs that manipulate an LLM's behavior) capable of altering medical recommendations has not been systematically evaluated. To evaluate the susceptibility of commercial LLMs to prompt-injection attacks that may induce unsafe clinical advice and to validate man-in-the-middle, client-side injection as a realistic attack vector. This quality improvement study used a controlled simulation design and was conducted between January and October 2025 using standardized patient-LLM dialogues. The main experiment evaluated 3 lightweight models (GPT-4o-mini [LLM 1], Gemini-2.0-flash-lite [LLM 2], and Claude-3-haiku [LLM 3]) across 12 clinical scenarios in 4 categories under controlled conditions. The 12 clinical scenarios were stratified by harm level across 4 categories: supplement recommendations, opioid prescriptions, pregnancy contraindications, and central-nervous-system toxic effects. A proof-of-concept experiment tested 3 flagship models (GPT-5 [LLM 4], Gemini 2.5 Pro [LLM 5], and Claude 4.5 Sonnet [LLM 6]) using client-side injection in a high-risk pregnancy scenario. Two prompt-injection strategies: (1) context-aware injection for moderate- and high-risk scenarios and (2) evidence-fabrication injection for extremely high-harm scenarios. Injections were programmatically inserted into user queries within a multiturn dialogue framework. The primary outcome was injection success at the primary decision turn. Secondary outcomes included persistence across dialogue turns and model-specific success rates by harm level. Across 216 evaluations (108 injection vs 108 control), attacks achieved 94.4% (102 of 108 evaluations) success at turn 4 and persisted in 69.4% (75 of 108 evaluations) of follow-ups. LLM 1 and LLM 2 were completely susceptible (36 of 36 dialogues [100%] each), and LLM 3 remained vulnerable in 83.3% of dialogues (30 of 36 dialogues). Extremely high-harm scenarios including US Food and Drug Administration Category X pregnancy drugs (eg, thalidomide) succeeded in 91.7% of dialogues (33 of 36 dialogues). The proof-of-concept experiment demonstrated 100% vulnerability for LLM 4 and LLM 5 (5 of 5 dialogues each) and 80.0% (4 of 5 dialogues) for LLM 6. In this quality improvement study using a controlled simulation, commercial LLMs demonstrated substantial vulnerability to prompt-injection attacks that could generate clinically dangerous recommendations; even flagship models with advanced safety mechanisms showed high susceptibility. These findings underscore the need for adversarial robustness testing, system-level safeguards, and regulatory oversight before clinical deployment.
- Research Article
192
- 10.1002/hcs2.61
- Jul 24, 2023
- Health Care Science
Recently, the emergence of ChatGPT, an artificial intelligence chatbot developed by OpenAI, has attracted significant attention due to its exceptional language comprehension and content generation capabilities, highlighting the immense potential of large language models (LLMs). LLMs have become a burgeoning hotspot across many fields, including health care. Within health care, LLMs may be classified into LLMs for the biomedical domain and LLMs for the clinical domain based on the corpora used for pre‐training. In the last 3 years, these domain‐specific LLMs have demonstrated exceptional performance on multiple natural language processing tasks, surpassing the performance of general LLMs as well. This not only emphasizes the significance of developing dedicated LLMs for the specific domains, but also raises expectations for their applications in health care. We believe that LLMs may be used widely in preconsultation, diagnosis, and management, with appropriate development and supervision. Additionally, LLMs hold tremendous promise in assisting with medical education, medical writing and other related applications. Likewise, health care systems must recognize and address the challenges posed by LLMs.
- Research Article
156
- 10.3390/informatics11030057
- Aug 7, 2024
- Informatics
The deployment of large language models (LLMs) within the healthcare sector has sparked both enthusiasm and apprehension. These models exhibit the remarkable ability to provide proficient responses to free-text queries, demonstrating a nuanced understanding of professional medical knowledge. This comprehensive survey delves into the functionalities of existing LLMs designed for healthcare applications and elucidates the trajectory of their development, starting with traditional Pretrained Language Models (PLMs) and then moving to the present state of LLMs in the healthcare sector. First, we explore the potential of LLMs to amplify the efficiency and effectiveness of diverse healthcare applications, particularly focusing on clinical language understanding tasks. These tasks encompass a wide spectrum, ranging from named entity recognition and relation extraction to natural language inference, multimodal medical applications, document classification, and question-answering. Additionally, we conduct an extensive comparison of the most recent state-of-the-art LLMs in the healthcare domain, while also assessing the utilization of various open-source LLMs and highlighting their significance in healthcare applications. Furthermore, we present the essential performance metrics employed to evaluate LLMs in the biomedical domain, shedding light on their effectiveness and limitations. Finally, we summarize the prominent challenges and constraints faced by large language models in the healthcare sector by offering a holistic perspective on their potential benefits and shortcomings. This review provides a comprehensive exploration of the current landscape of LLMs in healthcare, addressing their role in transforming medical applications and the areas that warrant further research and development.
- Research Article
19
- 10.1101/2024.04.26.24306390
- Apr 27, 2024
- medRxiv
Background:The launch of the Chat Generative Pre-trained Transformer (ChatGPT) in November 2022 has attracted public attention and academic interest to large language models (LLMs), facilitating the emergence of many other innovative LLMs. These LLMs have been applied in various fields, including healthcare. Numerous studies have since been conducted regarding how to employ state-of-the-art LLMs in health-related scenarios to assist patients, doctors, and public health administrators.Objective:This review aims to summarize the applications and concerns of applying conversational LLMs in healthcare and provide an agenda for future research on LLMs in healthcare.Methods:We utilized PubMed, ACM, and IEEE digital libraries as primary sources for this review. We followed the guidance of Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRIMSA) to screen and select peer-reviewed research articles that (1) were related to both healthcare applications and conversational LLMs and (2) were published before September 1st, 2023, the date when we started paper collection and screening. We investigated these papers and classified them according to their applications and concerns.Results:Our search initially identified 820 papers according to targeted keywords, out of which 65 papers met our criteria and were included in the review. The most popular conversational LLM was ChatGPT from OpenAI (60), followed by Bard from Google (1), Large Language Model Meta AI (LLaMA) from Meta (1), and other LLMs (5). These papers were classified into four categories in terms of their applications: 1) summarization, 2) medical knowledge inquiry, 3) prediction, and 4) administration, and four categories of concerns: 1) reliability, 2) bias, 3) privacy, and 4) public acceptability. There are 49 (75%) research papers using LLMs for summarization and/or medical knowledge inquiry, and 58 (89%) research papers expressing concerns about reliability and/or bias. We found that conversational LLMs exhibit promising results in summarization and providing medical knowledge to patients with a relatively high accuracy. However, conversational LLMs like ChatGPT are not able to provide reliable answers to complex health-related tasks that require specialized domain expertise. Additionally, no experiments in our reviewed papers have been conducted to thoughtfully examine how conversational LLMs lead to bias or privacy issues in healthcare research.Conclusions:Future studies should focus on improving the reliability of LLM applications in complex health-related tasks, as well as investigating the mechanisms of how LLM applications brought bias and privacy issues. Considering the vast accessibility of LLMs, legal, social, and technical efforts are all needed to address concerns about LLMs to promote, improve, and regularize the application of LLMs in healthcare.
- Research Article
3
- 10.2214/ajr.25.32729
- Jul 1, 2025
- AJR. American journal of roentgenology
BACKGROUND. The American College of Radiology (ACR) Incidental Findings Committee (IFC) algorithm provides guidance for pancreatic cystic lesion (PCL) management. Its implementation using plain-text large language model (LLM) solutions is challenging given that key components include multimodal data (e.g., figures and tables). OBJECTIVE. The purpose of the study is to evaluate a multimodal LLM approach incorporating knowledge retrieval using flowchart embedding for forming follow-up recommendations for PCL management. METHODS. This retrospective study included patients who underwent abdominal CT or MRI from September 1, 2023, to September 1, 2024, and whose report mentioned a PCL. The reports' Findings sections were inputted to a multimodal LLM (GPT-4o). For task 1 (198 patients: mean age, 69.0 ± 13.0 [SD] years; 110 women, 88 men), the LLM assessed PCL features (presence of PCL, PCL size and location, presence of main pancreatic duct communication, presence of worrisome features or high-risk stigmata) and formed a follow-up recommendation using three knowledge retrieval methods (default knowledge, plain-text retrieval-augmented generation [RAG] from the ACR IFC algorithm PDF document, and flowchart embedding using the LLM's image-to-text conversion for in-context integration of the document's flowcharts and tables). For task 2 (85 patients: mean initial age, 69.2 ± 10.8 years; 48 women, 37 men), an additional relevant prior report was inputted; the LLM assessed for interval PCL change and provided an adjusted follow-up schedule accounting for prior imaging using flowchart embedding. Three radiologists assessed LLM accuracy in task 1 for PCL findings in consensus and follow-up recommendations independently; one radiologist assessed accuracy in task 2. RESULTS. For task 1, the LLM with flowchart embedding had accuracy for PCL features of 98.0-99.0%. The accuracy of the LLM follow-up recommendations based on default knowledge, plain-text RAG, and flowchart embedding for radiologist 1 was 42.4%, 23.7%, and 89.9% (p < .001), respectively; radiologist 2 was 39.9%, 24.2%, and 91.9% (p < .001); and radiologist 3 was 40.9%, 25.3%, and 91.9% (p < .001). For task 2, the LLM using flowchart embedding showed an accuracy for interval PCL change of 96.5% and for adjusted follow-up schedules of 81.2%. CONCLUSION. Multimodal flowchart embedding aided the LLM's automated provision of follow-up recommendations adherent to a clinical guidance document. CLINICAL IMPACT. The framework could be extended to other incidental findings through the use of other clinical guidance documents as the model input.
- Research Article
6
- 10.1016/j.artmed.2025.103078
- Apr 1, 2025
- Artificial intelligence in medicine
Empowering large language models for automated clinical assessment with generation-augmented retrieval and hierarchical chain-of-thought.
- Research Article
- 10.1200/jco.2024.42.16_suppl.e13630
- Jun 1, 2024
- Journal of Clinical Oncology
e13630 Background: As many as 60% of prior authorization requests are denied, yet coverage approval occurs for more than 60% of appeals for some therapies. Appeal processes encumber providers and increase burnout, but large language models (LLMs) may aid providers by drafting appeal letters. We evaluated LLM performance at this task for radiotherapy denials. Methods: Three commercially accessible LLMs were evaluated: generative pre-trained transformer 3.5 (GPT3.5), GPT4, and GPT4+web with internet search capacity (OpenAI, Inc., San Francisco, CA). A fourth LLM, GPT3.5-FT, was developed by fine-tuning GPT3.5 in a HIPAA-complaint local environment. The fine-tuning training data comprised 53 insurance denial appeal letters prepared by radiation oncologists and paired prompts describing the clinical history and appeal intent. Training data were enriched in appeal letters for proton radiotherapy, stereotactic body radiotherapy, and image-guided radiotherapy for myriad clinical scenarios. Twenty prompts, each requesting a letter for a simulated patient history, were programmatically presented to the LLMs. Three radiation oncologists, who were blinded to the LLM source, scored letter outputs across four domains: language syntax and semantics, clinical detail inclusion, clinical reasoning validity, and overall readiness for insurer submission. Additionally, one radiation oncologist scored the authenticity and relevance of literature sources cited in output letters, which were requested by several test prompts. Interobserver agreement between radiation oncologist scores was determined by Cohen’s kappa coefficient. Scores were compared between LLMs with non-parametric statistical tests. Results: Agreement between radiation oncologists’ scores was moderate-to-excellent across all domains (median κ = 0.68, minimum κ = 0.41). GPT3.5, GPT4, and GPT4+web drafted letters that, by mode average, were semantically and syntactically clear, included all provided clinical history without confabulation, clinically reasoned with few necessary revisions, and overall were submissible to an insurer with minor revisions. GPT4 and GPT4+web clinically reasoned better than GPT3.5 (p values < 0.001). In contrast, GPT3.5-FT performance was inferior to other LLMs across all domains (p values < 0.001). LLMs were poor at identifying, citing, and summarizing relevant literature unless provided in the prompt. Conclusions: LLMs can draft insurance appeal letters for radiotherapy services that require few revisions yet are poor at referencing relevant literature. Contrary to our hypothesis, fine-tuning with data from our department compromised LLM performance.
- Research Article
34
- 10.1016/j.medp.2024.100030
- May 17, 2024
- Medicine Plus
Large language models illuminate a progressive pathway to artificial intelligent healthcare assistant
- Preprint Article
- 10.2196/preprints.68320
- Nov 3, 2024
BACKGROUND Medical question answering (QA) is essential for various medical applications. While small-scale pre-training language models (PLMs) are widely adopted in open-domain QA tasks through fine-tuning with related datasets, applying this approach in the medical domain requires significant and rigorous integration of external knowledge. Knowledge-enhanced small-scale PLMs have been proposed to incorporate knowledge bases (KBs) to improve performance, as KBs contain vast amounts of factual knowledge. Large language models (LLMs) contain a vast amount of knowledge and have attracted significant research interest due to their outstanding natural language processing (NLP) capabilities. KBs and LLMs can provide external knowledge to enhance small-scale models in medical QA. OBJECTIVE KBs consist of structured factual knowledge that must be converted into sentences to align with the input format of PLMs. However, these converted sentences often lack semantic coherence, potentially causing them to deviate from the intrinsic knowledge of KBs. LLMs, on the other hand, can generate natural, semantically rich sentences, but they may also produce irrelevant or inaccurate statements. Retrieval-augmented generation (RAG) paradigm enhances LLMs by retrieving relevant information from an external database before responding. By integrating LLMs and KBs using the RAG paradigm, it is possible to generate statements that combine the factual knowledge of KBs with the semantic richness of LLMs, thereby enhancing the performance of small-scale models. In this paper, we explore a RAG fine-tuning method, RAG-mQA, that combines KBs and LLMs to improve small-scale models in medical QA. METHODS In the RAG fine-tuning scenario, we adopt medical KBs as an external database to augment the text generation of LLMs, producing statements that integrate medical domain knowledge with semantic knowledge. Specifically, KBs are used to extract medical concepts from the input text, while LLMs are tasked with generating statements based on these extracted concepts. In addition, we introduce two strategies for constructing knowledge: KB-based and LLM-based construction. In the KB-based scenario, we extract medical concepts from the input text using KBs and convert them into sentences by connecting the concepts sequentially. In the LLM-based scenario, we provide the input text to an LLM, which generates relevant statements to answer the question. For downstream QA tasks, the knowledge produced by these three strategies is inserted into the input text to fine-tune a small-scale PLM. F1 and exact match (EM) scores are employed as evaluation metrics for performance comparison. Fine-tuned PLMs without knowledge insertion serve as baselines. Experiments are conducted on two medical QA datasets: emrQA (English) and MedicalQA (Chinese). RESULTS RAG-mQA achieved the best results on both datasets. On the MedicalQA dataset, compared to the KB-based and LLM-based enhancement methods, RAG-mQA improved the F1 score by 0.59% and 2.36%, and the EM score by 2.96% and 11.18%, respectively. On the emrQA dataset, the EM score of RAG-mQA exceeded those of the KB-based and LLM-based methods by 4.65% and 7.01%, respectively. CONCLUSIONS Experimental results demonstrate that RAG fine-tuning method can improve the model performance in medical QA. RAG-mQA achieves greater improvements compared to other knowledge-enhanced methods. CLINICALTRIAL This study does not involve trial registration.
- Research Article
1
- 10.2196/66126
- Jul 23, 2025
- JMIR formative research
Clinical reasoning is a critical skill for physical therapists, involving the collection and interpretation of patient information to form accurate diagnoses. Traditional training often lacks the diversity of clinical cases necessary for students to develop these skills comprehensively. Large language models (LLMs) like GPT-4 have the potential to simulate a wide range of clinical scenarios, offering a novel approach to enhance clinical reasoning in physical therapy education. The aim of the study is to explore the main barriers and facilitators that could be encountered in conducting a randomized clinical trial to study the effectiveness of the implementation of LLM models as tools to work on the clinical reasoning of physical therapy students. This pilot randomized parallel-group study involved 46 third-year physical therapy students at La Salle Centre for Higher University Studies. Participants were randomly assigned to either the experimental group, which received LLM training, or the control group, which followed the usual curriculum. The intervention lasted for 4 weeks, during which the experimental group used LLM to solve weekly clinical cases. Digital competencies, satisfaction, and costs were evaluated to explore the feasibility of this intervention. The recruitment and participation rates were high, but active engagement with the LLM was low, with only 5.75% (5/23) of the experimental group actively using the model. No significant difference in overall satisfaction was found between the groups, and the cost analysis reflected an initial cost of US $1738 for completing the study. While LLMs have the potential to enhance specific digital competencies in physical therapy students, their practical integration into the curriculum faces challenges. Future studies should focus on improving student engagement with LLMs and extending the training period to determine the feasibility of integrating this tool into physical therapy education and maximize benefits.
- Research Article
- 10.2196/83640
- Sep 5, 2025
- JMIR AI
Large language models (LLMs) are increasingly integrated into health care, where they contribute to patient care, administrative efficiency, and clinical decision-making. Despite their growing role, the ability of LLMs to handle imperfect inputs remains underexplored. These imperfections, which are common in clinical documentation and patient-generated data, may affect model reliability. This study investigates the impact of input perturbations on LLM performance across three dimensions: (1) overall effectiveness in different health-related applications, (2) comparative effects of different types and levels of perturbations, and (3) differential impact of perturbations on health-related terms versus non-health-related terms. We systematically evaluate 3 LLMs on 3 health-related tasks using a novel dataset containing 3 types of human-like variations (redaction, homophones, and typographical errors) at different perturbation levels. Contrary to expectations, LLMs demonstrate notable robustness to common variations, and in more than half of the cases (151/270, 55.92%), the performance was stable or improved. In some cases (38/270, 14.07%), variations resulted in an increased performance, especially when dealing with lower perturbation levels. Redactions, often stemming from privacy concerns or cognitive lapses, are more detrimental than other variations. Our findings highlight the need for health care applications powered by LLMs to be designed with input variability in mind. Robustness to noisy or imperfect inputs is essential for maintaining reliability in real-world clinical settings, where data quality can vary widely. By identifying specific vulnerabilities and strengths, this study provides actionable insights for improving model resilience and guiding the development of safer, more effective artificial intelligence tools in health care. The accompanying dataset offers a valuable resource for further research into LLM performance under diverse conditions.
- Research Article
- 10.3389/frai.2026.1716819
- Jan 30, 2026
- Frontiers in Artificial Intelligence
BackgroundProcess mining has emerged as a powerful analytical technique for understanding complex healthcare workflows. However, its application faces significant barriers, including technical complexity, a lack of standardized approaches, and limited access to practical training resources. To address unfamiliarity and improve accessibility, we proposed a new framework for translating technical analyses into text outputs that users can understand.ObjectiveWe introduce HealthProcessAI, a GenAI framework designed to simplify process mining applications in healthcare and epidemiology by providing a comprehensive wrapper around existing Python (PM4PY) and R (bupaR) libraries. To address unfamiliarity and improve accessibility, the framework integrates multiple Large Language Models (LLMs) for automated process map interpretation and report generation, helping translate technical analyses into outputs that diverse users can readily understand.MethodsHealthProcessAI implements modular architecture with the following components: (1) data loading and preparation, (2) process mining analysis, (3) integration of LLM for interpretation, (4) advanced analytics, (5) multimodal report orchestration, and (6) the validation framework. We validated the framework using sepsis progression data as a proof-of-concept example and compared the outputs of five state-of-the-art LLM models through the OpenRouter platform. This study presents a technical validation using automated LLM evaluation, and clinical validation by healthcare professionals is planned as future work.ResultsTo test its functionality, the framework successfully processed sepsis data across four proof-of-concept cases. A total of 32 reports were generated, demonstrating robust technical performance and its capability to generate reports through automated LLM analysis. In concrete terms, there are eight reports per case and four reports per LLM model. LLM evaluation using seven independent LLMs as automated evaluators revealed distinct model strengths: Claude Sonnet-4 and Gemini 2.5-Pro achieved the highest consistency scores (3.72/4.0 and 3.49/4.0) when evaluated by automated LLM assessors. It is important to note that outputs were not clinically validated by healthcare professionals.ConclusionHealthProcessAI provides a standardized framework that reduces technical and training barriers in healthcare process mining while maintaining scientific objectivity. By integrating multiple LLMs for automated interpretation and report generation, the framework addresses widespread unfamiliarity with process mining outputs, demonstrating technical feasibility for making them more accessible to clinicians, data scientists and researchers pending clinical validation. This structured analytics and AI-driven interpretation combination represents a novel methodological advance in translating complex process mining results into potentially actionable insights for healthcare applications. However, future work should involve systematic evaluation by clinicians.
- Research Article
- 10.1177/2473011425s00142
- Oct 1, 2025
- Foot & Ankle Orthopaedics
Research Type: Level 3 - Retrospective cohort study, Case-control study, Meta-analysis of Level 3 studies Introduction/Purpose: Identifying and tracking surgical complications is a critical component of maintaining quality registries and improving clinical care. Medical record review to record complications is often performed by a dedicated clinical team and is time-consuming and expensive. In recent years, Large Language Models (LLMs) have emerged as a promising Artificial Intelligence (AI) tool to more efficiently and accurately retrieve clinical information from patient records. However, early literature has shown that without careful prompt design, LLMs are vulnerable to error. The primary purpose of this study is to determine if an LLM platform, compared to traditional clinical chart reviewers, can be used reliably and automatically screen for complications directly from medical notes for patients who underwent total ankle arthroplasty (TAA). Methods: Following IRB approval, patient records were retrospectively identified from an institutional TAA registry with surgeries performed from 2015 to 2024. Patients were manually evaluated by the research team for intraoperative fracture (IOF), deep vein thrombosis (DVT), superficial wound infection (SWI), and deep wound infection (DWI). Patient records were then scrubbed of HIPAA-identifiers, age, and gender, and input into an LLM for analysis. An automated script was developed that assessed each patient visit note for complications and recorded the result in a table format with “yes” or “no” if a complication occurred. Disagreements between reviewer and LLM complications were secondarily reviewed by a blinded investigator to produce a final, gold standard data set. The sensitivity and specificity of both reviewer and LLM chart review are compared. Statistical difference between the groups is determined using a McNemar test and similarity of decisions was evaluated using an Intraclass Correlation Coefficient (ICC). Results: A total of 1952 notes were reviewed for 310 TAA procedures. The final rate of IOF, DVT, SWI, and DWI was found to be 5.2%, 0.0%, 4.2%, and 0.6%, respectively. Chart reviewers had high agreement with the LLM in evaluated DVT (100% match, ICC 1.00), and DWI (99.7% match, ICC 0.88), but significant disagreement in rate of IOF (97.1% match, ICC 0.77, p = 0.008) and SWI (91.9% match, ICC 0.49, p = 0.05). After secondary review by a blinded author, the LLM was found to have a higher sensitivity compared to reviewers for SWI (0.85 vs. 0.69, respectively) and IOF (1.00 vs. 0.50, respectively). However, reviewers had a higher specificity than the LLM for SWI (0.98 vs. 0.95, respectively). Conclusion: Our results support that current LLMs can be applied to screen free text medical records for complications at a sensitivity and specificity comparable to clinical chart reviewers. The LLM is more prone to both true and false positives, indicated by both a higher sensitivity and lower specificity. Importantly, this assessment of complications using the LLM was completely automated and did not require human intervention to run, allowing substantially higher efficiency. LLMs show promise to dramatically scale the size and reliability of clinical outcome and quality registries in coming years. Further refinement of the LLM script may improve accuracy. Comparing identified complications following total ankle arthroplasty between manual chart review and a large language model Rate of intraoperative fracture (IOP), deep vein thrombosis (DVT), superficial wound infection (SWI) and deep wound infection (DWI) identified by manual chart reviewers and a large language model (LLM). Sensitivity and specificity for each are provided relative to a verified gold-standard, a fellowship trained blind investigator.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.