Completeness and readability of GPT-4-generated multilingual discharge instructions in the pediatric emergency department

Alex Gimeno,Colin G Walsh,Kevin Krause,Starina D'Souza

doi:10.1093/jamiaopen/ooae050

Alex Gimeno, Colin G Walsh + Show 2 more

Open Access

PDF Available

https://doi.org/10.1093/jamiaopen/ooae050

Copy DOI

Export

Save

Cite

Journal: JAMIA Open	Publication Date: Jul 1, 2024
License type: CC BY-NC 4.0

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Abstract Objectives The aim of this study was to assess the completeness and readability of generative pre-trained transformer-4 (GPT-4)-generated discharge instructions at prespecified reading levels for common pediatric emergency room complaints. Materials and Methods The outputs for 6 discharge scenarios stratified by reading level (fifth or eighth grade) and language (English, Spanish) were generated fivefold using GPT-4. Specifically, 120 discharge instructions were produced and analyzed (6 scenarios: 60 in English, 60 in Spanish; 60 at a fifth-grade reading level, 60 at an eighth-grade reading level) and compared for completeness and readability (between language, between reading level, and stratified by group and reading level). Completeness was defined as the proportion of literature-derived key points included in discharge instructions. Readability was quantified using Flesch-Kincaid (English) and Fernandez-Huerta (Spanish) readability scores. Results English-language GPT-generated discharge instructions contained a significantly higher proportion of must-include discharge instructions than those in Spanish (English: mean (standard error of the mean) = 62% (3%), Spanish: 53% (3%), P = .02). In the fifth-grade and eighth-grade level conditions, there was no significant difference between English and Spanish outputs in completeness. Readability did not differ across languages. Discussion GPT-4 produced readable discharge instructions in English and Spanish while modulating document reading level. Discharge instructions in English tended to have higher completeness than those in Spanish. Conclusion Future research in prompt engineering and GPT-4 performance, both generally and in multiple languages, is needed to reduce potential for health disparities by language and reading level.

Full Text