Implementation and development experience of an AI‐assisted rostering system in a Hong Kong emergency department
Abstract Background Manual emergency department (ED) rostering is labour‐intensive and prone to inconsistency. We developed and implemented an artificial intelligence (AI)‐assisted rostering system that combined large language model (LLM)–supported coding with a constraint solver. This study describes its development, implementation and lessons learnt from real‐world use. Methods This implementation‐science project involved a clinician‐led team building a Python‐based rostering programme using ChatGPT for code generation and Google OR‐Tools for optimisation. Development followed iterative cycles of prototyping, testing and user feedback. The solver first generated a basic roster backbone of A (morning), P (afternoon), N (night) and O (off) duties under fixed, adjustable and soft constraints. A post‐processing module then translated these into duty subtypes to improve coverage. Implementation outcomes included efficiency, roster quality and coverage, assessed by workload balance, reduction of unfavourable patterns, fairness metrics and staff feedback. Results The system produced feasible rosters across five consecutive monthly cycles and reduced drafting time by over 90%. Roster quality improved with more balanced coverage among ranks, fairer duty and off‐day allocation and about 30% fewer unfavourable patterns. The model maintained consistent rest rules and equitable workload distribution. Early phases required constraint tuning and human verification, which decreased as the model stabilised. Informal feedback noted improved predictability, fairness and coverage stability. Conclusion An AI‐assisted rostering system was successfully developed and deployed in a clinical setting through iterative human–AI collaboration. LLM‐assisted programming enabled nonprogrammers to create adaptable operational tools. The modular backbone–post‐processing design allows replication in other EDs with minimal modification.
- Research Article
72
- 10.1001/jamanetworkopen.2024.8895
- May 7, 2024
- JAMA Network Open
The introduction of large language models (LLMs), such as Generative Pre-trained Transformer 4 (GPT-4; OpenAI), has generated significant interest in health care, yet studies evaluating their performance in a clinical setting are lacking. Determination of clinical acuity, a measure of a patient's illness severity and level of required medical attention, is one of the foundational elements of medical reasoning in emergency medicine. To determine whether an LLM can accurately assess clinical acuity in the emergency department (ED). This cross-sectional study identified all adult ED visits from January 1, 2012, to January 17, 2023, at the University of California, San Francisco, with a documented Emergency Severity Index (ESI) acuity level (immediate, emergent, urgent, less urgent, or nonurgent) and with a corresponding ED physician note. A sample of 10 000 pairs of ED visits with nonequivalent ESI scores, balanced for each of the 10 possible pairs of 5 ESI scores, was selected at random. The potential of the LLM to classify acuity levels of patients in the ED based on the ESI across 10 000 patient pairs. Using deidentified clinical text, the LLM was queried to identify the patient with a higher-acuity presentation within each pair based on the patients' clinical history. An earlier LLM was queried to allow comparison with this model. Accuracy score was calculated to evaluate the performance of both LLMs across the 10 000-pair sample. A 500-pair subsample was manually classified by a physician reviewer to compare performance between the LLMs and human classification. From a total of 251 401 adult ED visits, a balanced sample of 10 000 patient pairs was created wherein each pair comprised patients with disparate ESI acuity scores. Across this sample, the LLM correctly inferred the patient with higher acuity for 8940 of 10 000 pairs (accuracy, 0.89 [95% CI, 0.89-0.90]). Performance of the comparator LLM (accuracy, 0.84 [95% CI, 0.83-0.84]) was below that of its successor. Among the 500-pair subsample that was also manually classified, LLM performance (accuracy, 0.88 [95% CI, 0.86-0.91]) was comparable with that of the physician reviewer (accuracy, 0.86 [95% CI, 0.83-0.89]). In this cross-sectional study of 10 000 pairs of ED visits, the LLM accurately identified the patient with higher acuity when given pairs of presenting histories extracted from patients' first ED documentation. These findings suggest that the integration of an LLM into ED workflows could enhance triage processes while maintaining triage quality and warrants further investigation.
- Research Article
- 10.2196/72984
- Jul 31, 2025
- Journal of Medical Internet Research
BackgroundRecognizing patient symptoms is fundamental to medicine, research, and public health. However, symptoms are often underreported in coded formats even though they are routinely documented in physician notes. Large language models (LLMs), noted for their generalizability, could help bridge this gap by mimicking the role of human expert chart reviewers for symptom identification.ObjectiveThe primary objective of this multisite study was to measure the accurate identification of infectious respiratory disease symptoms using LLMs instructed to follow chart review guidelines. The secondary objective was to evaluate LLM generalizability in multisite settings without the need for site-specific training, fine-tuning, or customization.MethodsFour LLMs were evaluated: GPT-4, GPT-3.5, Llama2 70B, and Mixtral 8×7B. LLM prompts were instructed to take on the role of chart reviewers and follow symptom annotation guidelines when assessing physician notes. Ground truth labels for each note were annotated by subject matter experts. Optimal LLM prompting strategies were selected using a development corpus of 103 notes from the emergency department at Boston Children’s Hospital. The performance of each LLM was measured using a test corpus with 202 notes from Boston Children’s Hospital. The performance of an International Classification of Diseases, Tenth Revision (ICD-10)–based method was also measured as a baseline. Generalizability of the most performant LLM was then measured in a validation corpus of 308 notes from 21 emergency departments in the Indiana Health Information Exchange.ResultsSymptom identification accuracy was superior for every LLM tested for each infectious disease symptom compared to an ICD-10–based method (F1-score=45.1%). GPT-4 was the highest scoring (F1-score=91.4%; P<.001) and was significantly better than the ICD-10–based method, followed by GPT-3.5 (F1-score=90.0%; P<.001), Llama2 (F1-score=81.7%; P<.001), and Mixtral (F1-score=83.5%; P<.001). For the validation corpus, performance of the ICD-10–based method decreased (F1-score=26.9%), while GPT-4 increased (F1-score=94.0%), demonstrating better generalizability using GPT-4 (P<.001).ConclusionsLLMs significantly outperformed an ICD-10–based method for respiratory symptom identification in emergency department electronic health records. GPT-4 demonstrated the highest accuracy and generalizability, suggesting that LLMs may augment or replace traditional approaches. LLMs can be instructed to mimic human chart reviewers with high accuracy. Future work should assess broader symptom types and health care settings.
- Conference Article
- 10.54941/ahfe1006669
- Jan 1, 2025
Thematic Analysis (TA) is a powerful tool for human factors, HCI, and UX researchers to gather system usability insights from qualitative data like open-ended survey questions. However, TA is both time consuming and difficult, requiring researchers to review and compare hundreds, thousands, or even millions of pieces of text. Recently, this has driven many to explore using Large Language Models (LLMs) to support such an analysis. However, LLMs have their own processing limitations and usability challenges when implementing them reliably as part of a research process – especially when working with a large corpus of data that exceeds LLM context windows. These challenges are compounded when using locally hosted LLMs, which may be necessary to analyze sensitive and/or proprietary data. However, little human factors research has rigorously examined how various prompt engineering techniques can augment an LLM to overcome these limitations and improve usability. Accordingly, in the present paper, we investigate the impact of several prompt engineering techniques on the quality of LLM-mediated TA. Using a local LLM (Llama 3.1 8b) to ensure data privacy, we developed four LLM variants with progressively complex prompt engineering techniques and used them to extract themes from user feedback regarding the usability of a novel knowledge management system prototype. The LLM variants were as follows:1.A “baseline” variant with no prompt engineering or scalability2.A “naïve batch processing” variant that sequentially analyzed small batches of the user feedback to generate a single list of themes3.An “advanced batch processing” variant that built upon the naïve variant by adding prompt engineering techniques (e.g., chain-of-thought prompting)4.A “cognition-inspired” variant that incorporated advanced prompt engineering techniques and kept a working memory-like log of themes and their frequencyContrary to conventional approaches to studying LLMs, which largely rely upon descriptive statistics (e.g., % improvement), we systematically applied a set of evaluation methods from behavioral science and human factors. We performed three stages of evaluation of the outputs of each LLM variant: we compared the LLM outputs to our team’s original TA, we had human factors professionals (N = 4) rate the quality and usefulness of the outputs, and we compared the Inter-Rater Reliability (IRR) of other human factors professionals (N = 2) attempting to code the original data with the outputs generated by each variant. Results demonstrate that even small, locally deployed LLMs can produce high-quality TA when guided by appropriate prompts. While the “baseline” variant performed surprisingly well for small datasets, we found that the other, scalable methods were dependent upon advanced prompt engineering techniques to be successful. Only our novel "cognition-inspired" approach performed as well as the “baseline” variant in qualitative and quantitative comparisons of ratings and coding IRR. This research provides practical guidance for human factors researchers looking to integrate LLMs into their qualitative analysis workflows, disentangling and uncovering the importance of context window limitations, batch processing strategies, and advanced prompt engineering techniques. The findings suggest that local LLMs can serve as valuable and scalable tools in thematic analysis.
- Research Article
- 10.2196/69504
- Apr 11, 2025
- JMIR aging
Polypharmacy, the concurrent use of multiple medications, is prevalent among older adults and associated with increased risks for adverse drug events including falls. Deprescribing, the systematic process of discontinuing potentially inappropriate medications, aims to mitigate these risks. However, the practical application of deprescribing criteria in emergency settings remains limited due to time constraints and criteria complexity. This study aims to evaluate the performance of a large language model (LLM)-based pipeline in identifying deprescribing opportunities for older emergency department (ED) patients with polypharmacy, using 3 different sets of criteria: Beers, Screening Tool of Older People's Prescriptions, and Geriatric Emergency Medication Safety Recommendations. The study further evaluates LLM confidence calibration and its ability to improve recommendation performance. We conducted a retrospective cohort study of older adults presenting to an ED in a large academic medical center in the Northeast United States from January 2022 to March 2022. A random sample of 100 patients (712 total oral medications) was selected for detailed analysis. The LLM pipeline consisted of two steps: (1) filtering high-yield deprescribing criteria based on patients' medication lists, and (2) applying these criteria using both structured and unstructured patient data to recommend deprescribing. Model performance was assessed by comparing model recommendations to those of trained medical students, with discrepancies adjudicated by board-certified ED physicians. Selective prediction, a method that allows a model to abstain from low-confidence predictions to improve overall reliability, was applied to assess the model's confidence and decision-making thresholds. The LLM was significantly more effective in identifying deprescribing criteria (positive predictive value: 0.83; negative predictive value: 0.93; McNemar test for paired proportions: χ21=5.985; P=.02) relative to medical students, but showed limitations in making specific deprescribing recommendations (positive predictive value=0.47; negative predictive value=0.93). Adjudication revealed that while the model excelled at identifying when there was a deprescribing criterion related to one of the patient's medications, it often struggled with determining whether that criterion applied to the specific case due to complex inclusion and exclusion criteria (54.5% of errors) and ambiguous clinical contexts (eg, missing information; 39.3% of errors). Selective prediction only marginally improved LLM performance due to poorly calibrated confidence estimates. This study highlights the potential of LLMs to support deprescribing decisions in the ED by effectively filtering relevant criteria. However, challenges remain in applying these criteria to complex clinical scenarios, as the LLM demonstrated poor performance on more intricate decision-making tasks, with its reported confidence often failing to align with its actual success in these cases. The findings underscore the need for clearer deprescribing guidelines, improved LLM calibration for real-world use, and better integration of human-artificial intelligence workflows to balance artificial intelligence recommendations with clinician judgment.
- Research Article
- 10.1001/jamanetworkopen.2025.38427
- Oct 21, 2025
- JAMA Network Open
Emergency department (ED) discharge documentation is time-consuming and often incomplete. To develop a large language model (LLM) assistant that generates ED discharge notes and to evaluate its effectiveness on documentation quality and workflow efficiency. This comparative effectiveness study, which was conducted at a 2400-bed tertiary care hospital in South Korea, consisted of 2 primary phases: a development phase and sequential validation of the LLM assistant. In the randomized sequential prospective validation, 6 emergency physicians first wrote discharge notes manually (session 1), then edited LLM-generated drafts after a 1-hour washout period (session 2). Three independent physicians evaluated 300 note sets (each containing a manual note, an LLM draft, and an LLM-assisted note). For model development and validation, patient records from ED visits between September 1, 2022, and August 31, 2023, were used. The inclusion criteria encompassed adult patients (aged ≥17 years) and pediatric patients with nondisease conditions (eg, trauma, poisoning, or burns). Emergency physicians selected 592 representative cases for training and 50 for validation. A commercially available text generation transformer model was used as a core LLM, fine-tuned using the 592 training cases. Two distinct processing pipelines were implemented within the LLM assistant due to different input data: (1) for patients managed solely by emergency physicians, using the ED initial record and prescription list, and (2) for those requiring specialty consultations, using the ED initial record and consultation request form. Quality of notes using 4C metrics (completeness, correctness, conciseness, and clinical utility) on a Likert scale ranging from 1 to 5 and time taken to complete the notes manually and with the LLM assistant. Of the 50 test cases, the mean (SD) patient age was 57.7 (23.1) years, and 28 patients (56%) were female. LLM-assisted notes achieved higher scores than manual notes in completeness (4.23 [95% CI, 4.17-4.28] vs 4.03 [95% CI, 3.96-4.09]), correctness (4.38 [95% CI, 4.33-4.42] vs 4.20 [95% CI, 4.14-4.26]), conciseness (4.23 [95% CI, 4.18-4.28] vs 4.11 [95% CI, 4.05-4.17]), and clinical utility (4.17 [95% CI, 4.11-4.23] vs 3.85 [95% CI, 3.78-3.91]) (all P < .001). When compared with LLM drafts, LLM-assisted notes excelled in conciseness (4.23 vs 3.98 [95% CI, 3.91-4.04]; P < .001) and maintained equivalent clinical utility (4.17 vs 4.16 [95% CI, 4.11-4.21]; P > .99), but scored lower in completeness (4.23 vs 4.34 [95% CI, 4.29-4.39]; P = .001) and correctness (4.38 vs 4.45 [95% CI, 4.41-4.49]; P < .001). The median documentation time per note dropped from 69.5 (95% CI, 65.5-78.0) seconds for manual notes to 32.0 (95% CI, 29.5-36.0) seconds for LLM-assisted notes (P < .001). In this comparative effectiveness study, use of an on-site LLM assistant was associated with reduced writing time for ED discharge notes compared with manual note-taking, without compromising documentation quality, representing a critical advancement in the use of artificial intelligence for clinical practice.
- Research Article
- 10.1093/ndt/gfae069.657
- May 23, 2024
- Nephrology Dialysis Transplantation
Background and Aims The rapidly growing scientific literature poses a significant challenge for researchers seeking to distill key insights. We utilized Retrieval-Augmented Generation (RAG), a novel AI-driven approach, to efficiently process and extract meaningful information from published literature on uremic toxins. RAG is a general AI framework for improving the quality of responses generated by Large Language Models (LLMs) by supplementing the LLM's internal representation of information with curated expert knowledge. Method First, we collected on PubMed all abstracts related to the topic of “uremic toxins” through Metapub, a Python library designed to facilitate fetching metadata from PubMed. Second, we set up a RAG system that comprises 2 steps. In a retrieval step, the questions on topic (“uremic toxins”) and the documents (=all collected abstracts and manuscripts) are encoded into vectors (i.e., high-dimensional numerical representations). Similarity measures are used to find the best matches between documents and the questions on topic. Second, in the augmented generation step, the LLM (e.g., ChatGPT) uses these best matches of documents to generate a coherent and informed response. Results We collected 3497 abstracts from the PubMed and 191 expert-curated publications in PDF format related to the topic “uremic toxin”. These 191 publications were broken down to 5756 documents, each with a manageable size of text. The final vector database comprised 9253 vectors. Using RAG, we requested responses from the LLM on multiple questions related to “uremic toxins”. Some examples are shown in Table 1. The first and second responses given by the LLM are reasonable. However, the third answer shows the phenomenon of ‘hallucination’—where models generate plausible and convincingly sounding yet factually incorrect information. Conclusion The use of RAG improves the capability of LLMs to answer questions by leveraging the information contained within curated abstracts and publications. Despite the improvements with RAG, the phenomenon of ‘hallucination’ persists. A concerning feature of hallucinations is their eloquent and convincing language. For the time being, LLM output—even when improved with RAG—requires scrutiny and human verification.
- Research Article
- 10.1007/s11548-025-03475-1
- Jul 16, 2025
- International journal of computer assisted radiology and surgery
Large language models (LLMs) have a significant potential in healthcare due to their ability to process unstructured text from electronic health records (EHRs) and to generate knowledge with few or no training. In this study, we investigate the effectiveness of LLMs for clinical decision support, specifically in the context of emergency department triage, where the volume of textual data is minimal compared to other scenarios such as making a clinical diagnosis. We benchmark LLMs with traditional machine learning (ML) approaches using the Emergency Severity Index (ESI) as the gold standard criteria of triage. The benchmark includes general purpose, specialised, and fine-tuned LLMs. All models are prompted to predict ESI score from a EHRs. We use a balanced subset (n = 1000) from MIMIC-IV-ED, a large database containing records of admissions to the emergency department of Beth Israel Deaconess Medical Center. Our findings show that the best-performing models have an average F1-score below 0.60. Also, while zero-shot and fine-tuned LLMs can outperform standard ML models, their performance is surpassed by ML models augmented with features derived from LLMs or knowledge graphs. LLMs show value for clinical decision support in scenarios with limited textual data, such as emergency department triage. The study advocates for integrating LLM knowledge representation to improve existing ML models rather than using LLMs in isolation, suggesting this as a more promising approach to enhance the accuracy of automated triage systems.
- Research Article
9
- 10.2196/65454
- Jan 21, 2025
- JMIR Medical Informatics
BackgroundPrediction models have demonstrated a range of applications across medicine, including using electronic health record (EHR) data to identify hospital readmission and mortality risk. Large language models (LLMs) can transform unstructured EHR text into structured features, which can then be integrated into statistical prediction models, ensuring that the results are both clinically meaningful and interpretable.ObjectiveThis study aims to compare the classification decisions made by clinical experts with those generated by a state-of-the-art LLM, using terms extracted from a large EHR data set of individuals with mental health disorders seen in emergency departments (EDs).MethodsUsing a dataset from the EHR systems of more than 50 health care provider organizations in the United States from 2016 to 2021, we extracted all clinical terms that appeared in at least 1000 records of individuals admitted to the ED for a mental health–related problem from a source population of over 6 million ED episodes. Two experienced mental health clinicians (one medically trained psychiatrist and one clinical psychologist) reached consensus on the classification of EHR terms and diagnostic codes into categories. We evaluated an LLM’s agreement with clinical judgment across three classification tasks as follows: (1) classify terms into “mental health” or “physical health”, (2) classify mental health terms into 1 of 42 prespecified categories, and (3) classify physical health terms into 1 of 19 prespecified broad categories.ResultsThere was high agreement between the LLM and clinical experts when categorizing 4553 terms as “mental health” or “physical health” (κ=0.77, 95% CI 0.75-0.80). However, there was still considerable variability in LLM-clinician agreement on the classification of mental health terms (κ=0.62, 95% CI 0.59‐0.66) and physical health terms (κ=0.69, 95% CI 0.67‐0.70).ConclusionsThe LLM displayed high agreement with clinical experts when classifying EHR terms into certain mental health or physical health term categories. However, agreement with clinical experts varied considerably within both sets of mental and physical health term categories. Importantly, the use of LLMs presents an alternative to manual human coding, presenting great potential to create interpretable features for prediction models.
- Research Article
- 10.1038/s41598-025-07649-4
- Jul 14, 2025
- Scientific Reports
Identifying patients with critical illness in emergency departments (EDs) is an ongoing challenge, partly due to the limited information available at the time of admission. The clinical notes in patient records have already received attention for the value of improving prediction. Recent large language models (LLMs) have demonstrated their promising performance. However, the utilization of LLMs for analyzing clinical notes has not been extensively investigated. To improve the severity assessment of illness and the prediction of triage level, we developed a pipeline for utilizing LLMs (e.g. ChatGLM-2, GLM-4 and Alpaca-2) to extract information from patient complaint and anamnesis in clinical notes. In this pipeline, a LLM is supplied with the text input including complaint and anamnesis of a patient, where the input is further constructed by a prompt template, in-context learning (ICL), and retrieval-augmented generation (RAG). Then a severity score is extracted from the LLM, which is further integrated into a predictive model for improving its performance. We demonstrated the effectiveness of our pipeline based on the patient records derived from Chinese Emergency Triage, Assessment, and Treatment (CETAT) database. The extracted score were be incorporated into logistic regression as a predictor. At early stage, as vital signs were typically not yet measured, the predictive value of patient complaint and anamnesis was illustrated (evidenced by an improvement in AUC-ROC from 0.746 to 0.802). At later stage, vital signs became available, the enhancements in prediction attributable to the score were weaker, but still was observed with statistical significance in most cases. The recent LLMs are capable of extracting valuable information from clinical notes for identifying critical illness. The effectiveness has been illustrated in our study. It is still necessary to develop more efficient methods based on LLMs in order to achieve better performance.
- Conference Article
105
- 10.1145/3510003.3510203
- May 21, 2022
Large pre-trained language models such as GPT-3 [10], Codex [11], and Google's language model [7] are now capable of generating code from natural language specifications of programmer intent. We view these developments with a mixture of optimism and caution. On the optimistic side, such large language models have the potential to improve productivity by providing an automated AI pair programmer for every programmer in the world. On the cautionary side, since these large language models do not understand program semantics, they offer no guarantees about quality of the suggested code. In this paper, we present an approach to augment these large language models with post-processing steps based on program analysis and synthesis techniques, that understand the syntax and semantics of programs. Further, we show that such techniques can make use of user feedback and improve with usage. We present our experiences from building and evaluating such a tool Jigsaw, targeted at synthesizing code for using Python Pandas API using multi-modal inputs. Our experience suggests that as these large language models evolve for synthesizing code from intent, Jigsaw has an important role to play in improving the accuracy of the systems.
- Research Article
18
- 10.1101/2024.04.03.24305088
- Apr 4, 2024
- medRxiv
Importance:Large language models (LLMs) possess a range of capabilities which may be applied to the clinical domain, including text summarization. As ambient artificial intelligence scribes and other LLM-based tools begin to be deployed within healthcare settings, rigorous evaluations of the accuracy of these technologies are urgently needed.Objective:To investigate the performance of GPT-4 and GPT-3.5-turbo in generating Emergency Department (ED) discharge summaries and evaluate the prevalence and type of errors across each section of the discharge summary.Design:Cross-sectional study.Setting:University of California, San Francisco ED.Participants:We identified all adult ED visits from 2012 to 2023 with an ED clinician note. We randomly selected a sample of 100 ED visits for GPT-summarization.Exposure:We investigate the potential of two state-of-the-art LLMs, GPT-4 and GPT-3.5-turbo, to summarize the full ED clinician note into a discharge summary.Main Outcomes and Measures:GPT-3.5-turbo and GPT-4-generated discharge summaries were evaluated by two independent Emergency Medicine physician reviewers across three evaluation criteria: 1) Inaccuracy of GPT-summarized information; 2) Hallucination of information; 3) Omission of relevant clinical information. On identifying each error, reviewers were additionally asked to provide a brief explanation for their reasoning, which was manually classified into subgroups of errors.Results:From 202,059 eligible ED visits, we randomly sampled 100 for GPT-generated summarization and then expert-driven evaluation. In total, 33% of summaries generated by GPT-4 and 10% of those generated by GPT-3.5-turbo were entirely error-free across all evaluated domains. Summaries generated by GPT-4 were mostly accurate, with inaccuracies found in only 10% of cases, however, 42% of the summaries exhibited hallucinations and 47% omitted clinically relevant information. Inaccuracies and hallucinations were most commonly found in the Plan sections of GPT-generated summaries, while clinical omissions were concentrated in text describing patients’ Physical Examination findings or History of Presenting Complaint.Conclusions and Relevance:In this cross-sectional study of 100 ED encounters, we found that LLMs could generate accurate discharge summaries, but were liable to hallucination and omission of clinically relevant information. A comprehensive understanding of the location and type of errors found in GPT-generated clinical text is important to facilitate clinician review of such content and prevent patient harm.
- Research Article
- 10.1097/hc9.0000000000000638
- Mar 1, 2025
- Hepatology communications
Hepatic steatosis is a precursor to more severe liver disease, increasing morbidity and mortality risks. In the Emergency Department, routine abdominal imaging often reveals incidental hepatic steatosis that goes undiagnosed due to the acute nature of encounters. Imaging reports in the electronic health record contain valuable information not easily accessible as discrete data elements. We hypothesized that large language models could reliably detect hepatic steatosis from reports without extensive natural language processing training. We identified 200 adults who had CT abdominal imaging in the Emergency Department between August 1, 2016, and December 31, 2023. Using text from imaging reports and structured prompts, 3 Azure OpenAI models (ChatGPT 3.5, 4, 4o) identified patients with hepatic steatosis. We evaluated model performance regarding accuracy, inter-rater reliability, sensitivity, and specificity compared to physician reviews. The accuracy for the models was 96.2% for v3.5, 98.3% for v4, and 98.8% for v4o. Inter-rater reliability ranged from 0.99 to 1.00 across 10 iterations. Mean model confidence scores were 2.9 (SD 0.8) for v3.5, 3.9 (SD 0.3) for v4, and 4.0 (SD 0.07) for v4o. Incorrect evaluations were 76 (3.8%) for v3.5, 34 (1.7%) for v4, and 25 (1.3%) for v4o. All models showed sensitivity and specificity above 0.9. Large language models can assist in identifying incidental conditions from imaging reports that otherwise may be missed opportunities for early disease intervention. Large language models are a democratization of natural language processing by allowing for a user-friendly, expansive analyses of electronic medical records without requiring the development of complex natural language processing models.
- Research Article
36
- 10.1038/s41467-024-52415-1
- Oct 8, 2024
- Nature Communications
The release of GPT-4 and other large language models (LLMs) has the potential to transform healthcare. However, existing research evaluating LLM performance on real-world clinical notes is limited. Here, we conduct a highly-powered study to determine whether LLMs can provide clinical recommendations for three tasks (admission status, radiological investigation(s) request status, and antibiotic prescription status) using clinical notes from the Emergency Department. We randomly selected 10,000 Emergency Department visits to evaluate the accuracy of zero-shot, GPT-3.5-turbo- and GPT-4-turbo-generated clinical recommendations across four different prompting strategies. We found that both GPT-4-turbo and GPT-3.5-turbo performed poorly compared to a resident physician, with accuracy scores 8% and 24%, respectively, lower than physician on average. Both LLMs tended to be overly cautious in its recommendations, with high sensitivity at the cost of specificity. Our findings demonstrate that, while early evaluations of the clinical use of LLMs are promising, LLM performance must be significantly improved before their deployment as decision support systems for clinical recommendations and other complex tasks.
- Supplementary Content
- 10.1093/eurpub/ckaf161.1048
- Oct 1, 2025
- The European Journal of Public Health
BackgroundEfficient clinical documentation in emergency departments (EDs) is vital for timely, equitable care, especially for children with complex medical needs. Yet, documentation remains a resource-intensive bottleneck. While Large Language Models (LLMs) offer potential to automate this task, their utility in non-English healthcare systems is largely unexplored. This study evaluates the effectiveness of an LLM in structuring free-text data from pediatric ED records in Italy, highlighting implications for health system responsiveness and quality of care.MethodsWe conducted a retrospective study at the pediatric ED of Padova University Hospital, analyzing anonymized free-text admission records (2007-2023) from children with medical complexities. A manually labeled subset (n = 697) served as a gold standard, compared against LLM-extracted data, operating via GDPR-compliant, prompt-based interaction. We assessed extraction accuracy for key variables (e.g., triage codes, outcomes, referrals) and measured time efficiency gains.ResultsLLM reduced data extraction time from 5 minutes to 6 seconds per record. Accuracy was high for triage color codes (99.3%, 95% CI: 98.3-99.8%) and ED outcomes (98.6%, 95% CI: 97.4-99.3%). Performance was robust for procedural classifications but lower for medications (76.8%) and specialist consultations (72.2%), reflecting ambiguity in narrative notes. The model successfully adapted to non-English clinical texts, supporting its generalizability in multilingual health systems.ConclusionsLLMs can streamline ED documentation while maintaining high data fidelity, easing clinician workload, and enabling faster analytics for public health surveillance. This approach shows promise in enhancing pediatric emergency care quality and can scale to support digital health transformation across diverse European settings. Challenges remain in interpreting ambiguous inputs, underscoring the need for further model refinement and real-world validation.Key messages• LLMs substantially accelerate clinical data extraction in pediatric emergency care with high accuracy.• This AI-based solution supports scalable, multilingual digital health infrastructure for timely public health action.
- Research Article
- 10.64898/2025.12.17.25342510
- Dec 19, 2025
- medRxiv
IntroductionOpioid use disorder (OUD) is common in emergency departments (EDs); identification via structured computable phenotypes may miss important clinical context.ObjectiveCompare a computable structured OUD phenotype with a zero-shot large language model (LLM) using expert review as the reference.MethodsWe retrospectively analyzed 202 adult ED encounters. Two emergency physicians independently determined OUD status with consensus adjudication. The phenotype used ICD-10 codes, medications, toxicology, consult notes, and keyword rules. The LLM (GPT-4.1) classified OUD from concatenated ED notes. Test characteristics and McNemar’s tests were computed.ResultsExperts classified 56 (28%) encounters as OUD (κ=0.77). The phenotype showed sensitivity 0.98 (95% CI 0.93–1.00) and specificity 0.54 (0.44–0.64). The LLM showed sensitivity 0.93 (0.87–0.96) and specificity 0.90 (0.77–0.97). Sensitivity (p=0.0117) and specificity (p<0.001) differed significantly.ConclusionA zero-shot LLM achieved balanced performance, outperforming the structured phenotype on specificity while maintaining high sensitivity, supporting tiered ED screening for OUD.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.