Comparative Readability Assessment of Four Large Language Models in Answers to Common Contraception Questions [ID 2683638

Anisha V Patel,Aisvarya Panakam,Kanhai Amin,Rushabh H Doshi,Ankita Patil,Sangini S Sheth

doi:10.1097/01.aog.0001013004.01563.47

Abstract

INTRODUCTION: Large language models (LLMs) have the potential to revolutionize reproductive health education by offering accessible contraception counseling. The CDC, NIH, and AMA recommend presenting patient educational materials at or below a sixth-grade reading level. In this study, we present a comparative analysis of the readability of responses generated by LLMs, including OpenAI's ChatGPT 4.0, OpenAI's ChatGPT 3.5, Google Bard, and Microsoft Bing, to common contraception questions. METHODS: Referencing common contraception questions outlined in a recent review, we presented GPT-4.0, GPT-3.5, Google Bard, and Microsoft Bing with six common contraception questions on June 10, 2023, ensuring no prior chat bias. The readability of LLM outputs was evaluated using Flesch–Kincaid Grade Level (FK), Gunning Fog (GF), Automated Readability Index (ARI), and Coleman–Liau (CL) indices. All index score outputs correspond to a reading grade level (RGL), eg, RGL of 6 represents a sixth-grade level. RESULTS: Of the four LLMs, Google Bard demonstrated the highest readability with an average RGL of 10.6 (GF 10.1, FK 9.6, ARI 11.4, CL 11.4), followed by GPT-3.5 at an average RGL of 13.6 (GF 13.3, FK 12.6, ARI 15.2, CL 16.5). Microsoft Bing and GPT-4.0 yielded responses with higher average RGLs of 14.2 and 15.4, respectively. CONCLUSION: Google Bard and GPT-3.5 offered the most readable responses to contraception questions; however, these models still provided responses significantly above the recommended sixth-grade reading level (RGL of 6). Large language models should undergo evaluation to ensure alignment with established readability standards to optimize patient comprehension in reproductive counseling applications.

Full Text