The promise and peril of using a large language model to obtain clinical information: ChatGPT performs strongly as a fertility counseling tool with limitations

Joseph Chervenak,Harry Lieman,Miranda Blanco-Breindel,Sangita Jindal

doi:10.1016/j.fertnstert.2023.05.151

Joseph Chervenak, Harry Lieman + Show 2 more

Open Access

https://doi.org/10.1016/j.fertnstert.2023.05.151

Copy DOI

Journal: Fertility and sterility	Publication Date: May 20, 2023
Citations: 25	License type: cc-by

Affiliation: Albert Einstein College of Medicine

Abstract

To compare the responses of the large language model based "ChatGPT" to reputable sources when given fertility related clinical prompts DESIGN: ChatGPT "Feb 13" version by OpenAI was tested against established sources relating to patient oriented clinical information: 17 "Frequently Asked Questions" about infertility on the Centers for Disease Control (CDC) Website, 2 validated fertility knowledge surveys the Cardiff Fertility Knowledge Scale (CFKS) and the Fertility and Infertility Treatment Knowledge Score (FIT-KS), and the American Society for Reproductive Medicine (ASRM) Committee Opinion "Optimizing Natural Fertility." ChatGPT online chatbot INTERVENTION (FOR RCT) OR EXPOSURE (FOR OBSERVATIONAL STUDIES): FAQ's, survey questions and rephrased summary statements were entered as prompts in the chatbot over a 1-week period in February 2023. For FAQ's from CDC: words/response, sentiment analysis polarity and objectivity, total factual statements, rate of statements that were: incorrect, referenced source or noted value of consulting provider. For fertility knowledge surveys: percentile according to published population data. For Committee Opinion: whether response to conclusions rephrased as questions identified missing fact RESULTS: When administered the CDC's 17 Infertility FAQ's, ChatGPT produced responses of similar length (207.8 ChatGPT vs. 181.0 CDC words/response, p = NS), factual content (8.65 factual statements/response vs. 10.41, p =NS), sentiment polarity (mean 0.11 vs. 0.11 on a scale of -1 (negative) to 1 (positive), p = NS) and subjectivity (mean 0.42 vs. 0.35 on a scale of 0 (objective) to 1 (subjective), p = NS). 9 of the 147 (6.12%) ChatGPT factual statements were categorized as incorrect and only 1 statement (0.68%) cited a reference. ChatGPT would have been at the 87th percentile of Bunting's 2013 international cohort for the CFKS and at the 95th percentile based on Kudesia's 2017 cohort for the FIT-KS. ChatGPT reproduced the missing facts for all 7 summary statements from "Optimizing Natural Fertility." A February 2023 version of "ChatGPT" demonstrates the ability of generative Artificial Intelligence (AI) to produce relevant, meaningful responses to fertility related clinical queries comparable to established sources. While performance may improve with medical domain specific training, limitations such as the inability to reliably cite sources and the unpredictable possibility of fabricated information may limit its clinical utility.

Full Text