Large language models improve the identification of emergency department visits for symptomatic kidney stones

Cosmin A Bejan,Amy M Reed,Matthew Mikula,Siwei Zhang,Yaomin Xu,Daniel Fabbri,Peter J Embí,Ryan S Hsi

doi:10.1038/s41598-025-86632-5

Cosmin A Bejan, Amy M Reed + Show 6 more

https://doi.org/10.1038/s41598-025-86632-5

Copy DOI

Export

Save

Cite

Journal: Scientific Reports

Publication Date: Jan 28, 2025

Abstract
Full-Text
Similar Papers

Abstract

Listen

Recent advancements of large language models (LLMs) like generative pre-trained transformer 4 (GPT-4) have generated significant interest among the scientific community. Yet, the potential of these models to be utilized in clinical settings remains largely unexplored. In this study, we investigated the abilities of multiple LLMs and traditional machine learning models to analyze emergency department (ED) reports and determine if the corresponding visits were due to symptomatic kidney stones. Leveraging a dataset of manually annotated ED reports, we developed strategies to enhance LLMs including prompt optimization, zero- and few-shot prompting, fine-tuning, and prompt augmentation. Further, we implemented fairness assessment and bias mitigation methods to investigate the potential disparities by LLMs with respect to race and gender. A clinical expert manually assessed the explanations generated by GPT-4 for its predictions to determine if they were sound, factually correct, unrelated to the input prompt, or potentially harmful. The best results were achieved by GPT-4 (macro-F1 = 0.833, 95% confidence interval [CI] 0.826–0.841) and GPT-3.5 (macro-F1 = 0.796, 95% CI 0.796–0.796). Ablation studies revealed that the initial pre-trained GPT-3.5 model benefits from fine-tuning. Adding demographic information and prior disease history to the prompts allows LLMs to make better decisions. Bias assessment found that GPT-4 exhibited no racial or gender disparities, in contrast to GPT-3.5, which failed to effectively model racial diversity.

Full Text