Generative Large Language Models in Electronic Health Records for Patient Care Since 2023: A Systematic Review.

Xinsong Du,Pengyu Hong,Ya-Wen Chuang,Xinyi Wang,David W Bates,Yifei Wang,Rui Zhang,Li Zhou,Zhengyang Zhou,Wenyu Zhang,Richard Yang

doi:10.1101/2024.08.11.24311828

Abstract

Generative Large language models (LLMs) represent a significant advancement in natural language processing, achieving state-of-the-art performance across various tasks. However, their application in clinical settings using real electronic health records (EHRs) is still rare and presents numerous challenges. This study aims to systematically review the use of generative LLMs, and the effectiveness of relevant techniques in patient care-related topics involving EHRs, summarize the challenges faced, and suggest future directions. A Boolean search for peer-reviewed articles was conducted on May 19th, 2024 using PubMed and Web of Science to include research articles published since 2023, which was one month after the release of ChatGPT. The search results were deduplicated. Multiple reviewers, including biomedical informaticians, computer scientists, and a physician, screened the publications for eligibility and conducted data extraction. Only studies utilizing generative LLMs to analyze real EHR data were included. We summarized the use of prompt engineering, fine-tuning, multimodal EHR data, and evaluation matrices. Additionally, we identified current challenges in applying LLMs in clinical settings as reported by the included studies and proposed future directions. The initial search identified 6,328 unique studies, with 76 studies included after eligibility screening. Of these, 67 studies (88.2%) employed zero-shot prompting, five of them reported 100% accuracy on five specific clinical tasks. Nine studies used advanced prompting strategies; four tested these strategies experimentally, finding that prompt engineering improved performance, with one study noting a non-linear relationship between the number of examples in a prompt and performance improvement. Eight studies explored fine-tuning generative LLMs, all reported performance improvements on specific tasks, but three of them noted potential performance degradation after fine-tuning on certain tasks. Only two studies utilized multimodal data, which improved LLM-based decision-making and enabled accurate rare disease diagnosis and prognosis. The studies employed 55 different evaluation metrics for 22 purposes, such as correctness, completeness, and conciseness. Two studies investigated LLM bias, with one detecting no bias and the other finding that male patients received more appropriate clinical decision-making suggestions. Six studies identified hallucinations, such as fabricating patient names in structured thyroid ultrasound reports. Additional challenges included but were not limited to the impersonal tone of LLM consultations, which made patients uncomfortable, and the difficulty patients had in understanding LLM responses. Our review indicates that few studies have employed advanced computational techniques to enhance LLM performance. The diverse evaluation metrics used highlight the need for standardization. LLMs currently cannot replace physicians due to challenges such as bias, hallucinations, and impersonal responses.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Generative Large Language Models in Electronic Health Records for Patient Care Since 2023: A Systematic Review.

Abstract

Talk to us

Similar Papers

More From: medRxiv : the preprint server for health sciences

Lead the way for us

Journal: medRxiv : the preprint server for health sciences	Publication Date: Aug 19, 2024
License type: CC BY 4.0

Similar Papers

Large language models for biomedicine: foundations, opportunities, challenges, and best practices.
Satya S Sahoo ... Trevor Cohen
Journal of the American Medical Informatics Association : JAMIA | VOL. 31
Satya S Sahoo, et. al.Satya S Sahoo ... Trevor Cohen
24 Apr 2024
Journal of the American Medical Informatics Association : JAMIA | VOL. 31

#2924 Comparison of large language models and traditional natural language processing techniques in predicting arteriovenous fistula failure
Suman Lama ... Luca Neri
Nephrology Dialysis Transplantation | VOL. 39
Suman Lama, et. al.Suman Lama ... Luca Neri
23 May 2024
Nephrology Dialysis Transplantation | VOL. 39

A systematic evaluation of the performance of GPT‐4 and PaLM2 to diagnose comorbidities in MIMIC‐IV patients
Peter Sarvari ... Abdullatif Ghuwel
Health Care Science | VOL. 3
Peter Sarvari, et. al.Peter Sarvari ... Abdullatif Ghuwel
01 Feb 2024
Health Care Science | VOL. 3

Unlocking the Potential of Free Text in Electronic Health Records with Large Language Models (LLM): Enhancing Patient Safety and Consultation Interactions.
Pushpa Kumarapeli ... Simon De Lusignan
Studies in health technology and informatics | VOL. 316
Pushpa Kumarapeli, et. al.Pushpa Kumarapeli ... Simon De Lusignan
22 Aug 2024
Studies in health technology and informatics | VOL. 316

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Generative Large Language Models in Electronic Health Records for Patient Care Since 2023: A Systematic Review.

Abstract

Talk to us

Similar Papers

More From: medRxiv : the preprint server for health sciences