The Performance of ChatGPT-4 and Gemini Ultra 1.0 for Quality Assurance Review in Emergency Medical Services Chest Pain Calls

Graham Brant-Zawadzki,Brent Klapthor,Chris Ryba,Drew C Youngquist,Brooke Burton,Helen Palatinus,Scott T Youngquist

doi:10.1080/10903127.2024.2376757

Abstract

Objectives This study assesses the feasibility, inter-rater reliability, and accuracy of using OpenAI's ChatGPT-4 and Google’s Gemini Ultra large language models (LLMs), for Emergency Medical Services (EMS) quality assurance. The implementation of these LLMs for EMS quality assurance has the potential to significantly reduce the workload on medical directors and quality assurance staff by automating aspects of the processing and review of patient care reports. This offers the potential for more efficient and accurate identification of areas requiring improvement, thereby potentially enhancing patient care outcomes. Methods Two expert human reviewers, ChatGPT GPT-4, and Gemini Ultra assessed and rated 150 consecutively sampled and anonymized prehospital records from 2 large urban EMS agencies for adherence to 2020 National Association of State EMS metrics for cardiac care. We evaluated the accuracy of scoring, inter-rater reliability, and review efficiency. The inter-rater reliability for the dichotomous outcome of each EMS metric was measured using the kappa statistic. Results Human reviewers showed high interrater reliability, with 91.2% agreement and a kappa coefficient 0.782 (0.654-0.910). ChatGPT-4 achieved substantial agreement with human reviewers in EKG documentation and aspirin administration (76.2% agreement, kappa coefficient 0.401 (0.334-0.468), but performance varied across other metrics. Gemini Ultra’s evaluation was discontinued due to poor performance. No significant differences were observed in median review times: 01:28 min (IQR 1:12 − 1:51 min) per human chart review, 01:24 min (IQR 01:09 − 01:53 min) per ChatGPT-4 chart review (p = 0.46), and 01:50 min (IQR 01:10-03:34 min) per Gemini Ultra review (p = 0.06). Conclusions Large language models demonstrate potential in supporting quality assurance by effectively and objectively extracting data elements. However, their accuracy in interpreting non-standardized and time-sensitive details remains inferior to human evaluators. Our findings suggest that current LLMs may best offer supplemental support to the human review processes, but their current value remains limited. Enhancements in LLM training and integration are recommended for improved and more reliable performance in the quality assurance processes.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

The Performance of ChatGPT-4 and Gemini Ultra 1.0 for Quality Assurance Review in Emergency Medical Services Chest Pain Calls

Abstract

Talk to us

Similar Papers

More From: Prehospital Emergency Care

Lead the way for us

Similar Papers

The Relationship Between Ski Patrols and Emergency Medical Services Systems
Seth C Hawkins
Wilderness & Environmental Medicine | VOL. 23
Seth C HawkinsSeth C Hawkins
30 May 2012
Wilderness & Environmental Medicine | VOL. 23

208: Early Experiences With Electronic Patient Care Reports by Emergency Medical Services Agencies
A.B Landman ... L.A Curry
Annals of Emergency Medicine | VOL. 56
A.B Landman, et. al.A.B Landman ... L.A Curry
25 Aug 2010
Annals of Emergency Medicine | VOL. 56

Personality testing of large language models: limited temporal stability, but highlighted prosociality.
Bojana Bodroža ... Ljubiša Bojić
Royal Society open science | VOL. 11
Bojana Bodroža, et. al.Bojana Bodroža ... Ljubiša Bojić
01 Oct 2024
Royal Society open science | VOL. 11

Abstract WP276: Simplification of a Prehospital Short NIHSS Scale Does not Increase Interrater Agreement Between Emergency Medical Services and Stroke Specialists
Jelle Demeestere ... Carlos Garcia-Esperon
Stroke | VOL. 48
Jelle Demeestere, et. al.Jelle Demeestere ... Carlos Garcia-Esperon
01 Feb 2017
Stroke | VOL. 48

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The Performance of ChatGPT-4 and Gemini Ultra 1.0 for Quality Assurance Review in Emergency Medical Services Chest Pain Calls

Abstract

Talk to us

Similar Papers

More From: Prehospital Emergency Care