Evaluating the performance of five large language models in generating patient educational content for pediatric cardiothoracic procedures: a comparative study.

  • Abstract
  • Literature Map
  • References
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

This study aims to evaluate the efficacy of Large Language Models (LLMs) in generating patient educational content on pediatric cardiothoracic surgical procedures. In this comparative observational study we employed five LLMs,ChatGPT 4o, ChatGPT 4, Google Gemini, Perplexity AI and Claude AI,to create educational pamphlets for 24 different pediatric cardiothoracic procedures. Each LLM produced three pamphlets per procedure, resulting in a total of 360 unique pamphlets. These pamphlets were evaluated for accuracy, consistency, and relevance using structured scoring scales. Five reviewers were employed, resulting in 1800 evaluations for accuracy and 600 for consistency. Patient advocates independently reviewed relevance. Readability was assessed using six different metrics. The study revealed significant differences in accuracy, with Perplexity AI performing best in cardiac procedures (p < 0.00001) and Claude AI excelling in pulmonary procedures (p = 0.001). Consistency across models varied significantly, with ChatGPT 4 having high variability across pamphlets. Readability analysis indicated that Gemini produced the most comprehensible content. The overall relevance of the pamphlets was highest with Perplexity AI (p < 0.00001). Post-hoc analysis revealed that overall, ChatGPT 4 and Perplexity AI tend to have similar levels of readability across the measures. LLMs demonstrate significant potential in creating educational materials for pediatric cardiothoracic surgery. However, our findings suggest that their effectiveness varies based on the type of procedure and evaluation criteria. Tailoring LLM-generated content to specific contexts, along with physician oversight, is critical. Additionally, readability should be optimized to ensure adequate comprehension by the general public.

Similar Papers
  • Research Article
  • Cite Count Icon 8
  • 10.1287/ijds.2023.0007
How Can IJDS Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
  • Apr 1, 2023
  • INFORMS Journal on Data Science
  • Galit Shmueli + 7 more

How Can <i>IJDS</i> Authors, Reviewers, and Editors Use (and Misuse) Generative AI?

  • Research Article
  • Cite Count Icon 1
  • 10.2196/56126
Proficiency, Clarity, and Objectivity of Large Language Models Versus Specialists' Knowledge on COVID-19's Impacts in Pregnancy: Cross-Sectional Pilot Study.
  • Feb 5, 2025
  • JMIR formative research
  • Nicola Luigi Bragazzi + 7 more

The COVID-19 pandemic has significantly strained health care systems globally, leading to an overwhelming influx of patients and exacerbating resource limitations. Concurrently, an "infodemic" of misinformation, particularly prevalent in women's health, has emerged. This challenge has been pivotal for health care providers, especially gynecologists and obstetricians, in managing pregnant women's health. The pandemic heightened risks for pregnant women from COVID-19, necessitating balanced advice from specialists on vaccine safety versus known risks. In addition, the advent of generative artificial intelligence (AI), such as large language models (LLMs), offers promising support in health care. However, they necessitate rigorous testing. This study aimed to assess LLMs' proficiency, clarity, and objectivity regarding COVID-19's impacts on pregnancy. This study evaluates 4 major AI prototypes (ChatGPT-3.5, ChatGPT-4, Microsoft Copilot, and Google Bard) using zero-shot prompts in a questionnaire validated among 159 Israeli gynecologists and obstetricians. The questionnaire assesses proficiency in providing accurate information on COVID-19 in relation to pregnancy. Text-mining, sentiment analysis, and readability (Flesch-Kincaid grade level and Flesch Reading Ease Score) were also conducted. In terms of LLMs' knowledge, ChatGPT-4 and Microsoft Copilot each scored 97% (32/33), Google Bard 94% (31/33), and ChatGPT-3.5 82% (27/33). ChatGPT-4 incorrectly stated an increased risk of miscarriage due to COVID-19. Google Bard and Microsoft Copilot had minor inaccuracies concerning COVID-19 transmission and complications. In the sentiment analysis, Microsoft Copilot achieved the least negative score (-4), followed by ChatGPT-4 (-6) and Google Bard (-7), while ChatGPT-3.5 obtained the most negative score (-12). Finally, concerning the readability analysis, Flesch-Kincaid Grade Level and Flesch Reading Ease Score showed that Microsoft Copilot was the most accessible at 9.9 and 49, followed by ChatGPT-4 at 12.4 and 37.1, while ChatGPT-3.5 (12.9 and 35.6) and Google Bard (12.9 and 35.8) generated particularly complex responses. The study highlights varying knowledge levels of LLMs in relation to COVID-19 and pregnancy. ChatGPT-3.5 showed the least knowledge and alignment with scientific evidence. Readability and complexity analyses suggest that each AI's approach was tailored to specific audiences, with ChatGPT versions being more suitable for specialized readers and Microsoft Copilot for the general public. Sentiment analysis revealed notable variations in the way LLMs communicated critical information, underscoring the essential role of neutral and objective health care communication in ensuring that pregnant women, particularly vulnerable during the COVID-19 pandemic, receive accurate and reassuring guidance. Overall, ChatGPT-4, Microsoft Copilot, and Google Bard generally provided accurate, updated information on COVID-19 and vaccines in maternal and fetal health, aligning with health guidelines. The study demonstrated the potential role of AI in supplementing health care knowledge, with a need for continuous updating and verification of AI knowledge bases. The choice of AI tool should consider the target audience and required information detail level.

  • Research Article
  • Cite Count Icon 2
  • 10.3748/wjg.v31.i3.101092
Exploring the performance of large language models on hepatitis B infection-related questions: A comparative study
  • Jan 21, 2025
  • World Journal of Gastroenterology
  • Yu Li + 5 more

BACKGROUNDPatients with hepatitis B virus (HBV) infection require chronic and personalized care to improve outcomes. Large language models (LLMs) can potentially provide medical information for patients.AIMTo examine the performance of three LLMs, ChatGPT-3.5, ChatGPT-4.0, and Google Gemini, in answering HBV-related questions.METHODSLLMs’ responses to HBV-related questions were independently graded by two medical professionals using a four-point accuracy scale, and disagreements were resolved by a third reviewer. Each question was run three times using three LLMs. Readability was assessed via the Gunning Fog index and Flesch-Kincaid grade level.RESULTSOverall, all three LLM chatbots achieved high average accuracy scores for subjective questions (ChatGPT-3.5: 3.50; ChatGPT-4.0: 3.69; Google Gemini: 3.53, out of a maximum score of 4). With respect to objective questions, ChatGPT-4.0 achieved an 80.8% accuracy rate, compared with 62.9% for ChatGPT-3.5 and 73.1% for Google Gemini. Across the six domains, ChatGPT-4.0 performed better in terms of diagnosis, whereas Google Gemini demonstrated excellent clinical manifestations. Notably, in the readability analysis, the mean Gunning Fog index and Flesch-Kincaid grade level scores of the three LLM chatbots were significantly higher than the standard level eight, far exceeding the reading level of the normal population.CONCLUSIONOur results highlight the potential of LLMs, especially ChatGPT-4.0, for delivering responses to HBV-related questions. LLMs may be an adjunctive informational tool for patients and physicians to improve outcomes. Nevertheless, current LLMs should not replace personalized treatment recommendations from physicians in the management of HBV infection.

  • Research Article
  • 10.2196/70703
AI in Home Care-Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study.
  • Apr 28, 2025
  • Journal of medical Internet research
  • Clara Pérez-Esteve + 6 more

The aging population presents an accomplishment for society but also poses significant challenges for governments, health care systems, and caregivers. Elevated rates of functional limitations among older adults, primarily caused by chronic conditions, necessitate adequate and safe care, including in-home settings. Traditionally, informal caregiver training has relied on verbal and written instructions. However, the advent of digital resources has introduced videos and interactive platforms, offering more accessible and effective training. Large language models (LLMs) have emerged as potential tools for personalized information delivery. While LLMs exhibit the capacity to mimic clinical reasoning and support decision-making, their potential to serve as alternatives to evidence-based professional instruction remains unexplored. We aimed to evaluate the appropriateness of home care instructions generated by LLMs (including GPTs) in comparison to a professional gold standard. Furthermore, it seeks to identify specific domains where LLMs show the most promise and where improvements are necessary to optimize their reliability for caregiver training. An observational, comparative case study evaluated 3 LLMs-GPT-3.5, GPT-4o, and Microsoft Copilot-in 10 home care scenarios. A rubric assessed the models against a reference standard (gold standard) created by health care professionals. Independent reviewers evaluated variables including specificity, clarity, and self-efficacy. In addition to comparing each LLM to the gold standard, the models were also compared against each other across all study domains to identify relative strengths and weaknesses. Statistical analyses compared LLMs performance to the gold standard to ensure consistency and validity, as well as to analyze differences between LLMs across all evaluated domains. The study revealed that while no LLM achieved the precision of the professional gold standard, GPT-4o outperformed GPT-3.5, and Copilot in specificity (4.6 vs 3.7 and 3.6), clarity (4.8 vs 4.1 and 3.9), and self-efficacy (4.6 vs 3.8 and 3.4). However, the models exhibited significant limitations, with GPT-4o and Copilot omitting relevant details in 60% (6/10) of the cases, and GPT-3.5 doing so in 80% (8/10). When compared to the gold standard, only 10% (2/20) of GPT-4o responses were rated as equally specific, 20% (4/20) included comparable practical advice, and just 5% (1/20) provided a justification as detailed as professional guidance. Furthermore, error frequency did not differ significantly across models (P=.65), though Copilot had the highest rate of incorrect information (20%, 2/10 vs 10%, 1/10 for GPT-4o and 0%, 0/0 for GPT-3.5). LLMs, particularly GPT-4o subscription-based, show potential as tools for training informal caregivers by providing tailored guidance and reducing errors. Although not yet surpassing professional instruction quality, these models offer a flexible and accessible alternative that could enhance home safety and care quality. Further research is necessary to address limitations and optimize their performance. Future implementation of LLMs may alleviate health care system burdens by reducing common caregiver errors.

  • Research Article
  • Cite Count Icon 1
  • 10.1152/advan.00209.2024
Transforming medical education: leveraging large language models to enhance PBL-a proof-of-concept study.
  • Jun 1, 2025
  • Advances in physiology education
  • Shoukat Ali Arain + 3 more

The alignment of learning materials with learning objectives (LOs) is critical for successfully implementing the problem-based learning (PBL) curriculum. This study investigated the capabilities of Gemini Advanced, a large language model (LLM), in creating clinical vignettes that align with LOs and comprehensive tutor guides. This study used a faculty-written clinical vignette about diabetes mellitus for third-year medical students. We submitted the LOs and the associated clinical vignette and tutor guide to the LLM to evaluate their alignment and generate new versions. Four faculty members compared both versions, using a structured questionnaire. The mean evaluation scores for original and LLM-generated versions are reported. The LLM identified new triggers for the clinical vignette to align it better with the LOs. Moreover, it restructured the tutor guide for better organization and flow and included thought-provoking questions. The medical information provided by the LLM was scientifically appropriate and accurate. The LLM-generated clinical vignette scored higher (3.0 vs. 1.25) for alignment with the LOs. However, the original version scored better for being educational level-appropriate (2.25 vs. 1.25) and adhering to PBL design (2.50 vs. 1.25). The LLM-generated tutor guide scored higher for better flow (3.0 vs. 1.25), comprehensive and relevant content (2.75 vs. 1.50), and thought-provoking questions (2.25 vs. 1.75). However, LLM-generated learning material lacked visual elements. In conclusion, this study demonstrated that Gemini could align and improve PBL learning materials. By leveraging the potential of LLMs while acknowledging their limitations, medical educators can create innovative and effective learning experiences for future physicians.NEW & NOTEWORTHY This study evaluated a large language model (LLM) (Gemini Advanced) for creating aligned problem-based learning (PBL) materials. The LLM improved the alignment of the clinical vignette with learning goals. The LLM also restructured the tutor guide and added thought-provoking questions. The LLM guide was well organized and informative, but the original vignette was considered more educational level-appropriate. Although the LLM could not generate visuals, AI can improve PBL materials, especially when combined with human expertise.

  • Research Article
  • 10.1177/15563316251340697
Competencies of Large Language Models About Piriformis Syndrome: Quality, Accuracy, Completeness, and Readability Study
  • May 20, 2025
  • HSS Journal®: The Musculoskeletal Journal of Hospital for Special Surgery
  • Burak Tayyip Dede + 4 more

Background: The proliferation of artificial intelligence has led to widespread patient use of large language models (LLMs). Purpose : We sought to characterize LLM responses to questions about piriformis syndrome (PS). Methods : On August 15, 2024, we asked 3 LLMs—ChatGPT-4, Copilot, and Gemini—to respond to the 25 most frequently asked questions about PS, as tracked by Google Trends. We evaluated the accuracy and completeness of the responses according to the Likert scale. We used the Ensuring Quality Information for Patients (EQIP) tool to assess the quality of the responses and assessed readability using Flesch–Kincaid Reading Ease (FKRE) and Flesch–Kincaid Grade Level (FKGL) scores. Results : The mean completeness scores of the responses obtained from ChatGPT, Copilot, and Gemini were 2.8 ± 0.3, 2.2 ± 0.6, and 2.6 ± 0.4, respectively. There was a significant difference in the mean completeness score among LLMs. In pairwise comparisons, ChatGPT and Gemini were superior to Copilot. There was no significant difference between the LLMs in terms of mean accuracy scores. In readability analyses, no significant difference was found in terms of FKRE scores. However, a significant difference was found in FKGL scores. A significant difference between LLMs was identified in the quality analysis performed according to EQIP scores. Conclusion : Although the use of LLMs in healthcare is promising, our findings suggest that these technologies need to be improved to perform better in terms of accuracy, completeness, quality, and readability on PS for a general audience.

  • Research Article
  • 10.1167/tvst.14.8.19
Enhancing the Readability of Online Pediatric Cataract Education Materials: A Comparative Study of Large Language Models.
  • Aug 1, 2025
  • Translational vision science & technology
  • Xinyi Qiu + 5 more

The purpose of this study was to assess large language models (LLMs) for enhancing the readability of online patient education materials (PEMs) on pediatric cataracts through multilingual adaptation, content retrieval, and prompt engineering. This study included 103 PEMs presented in different languages and retrieved from diverse resources. Three LLMs (ChatGPT-4o, Gemini 2.0, and DeepSeek-R1) were used for content improvement. Readability was assessed for both the original and converted PEMs with multiple formulas. Different prompt engineering strategies for LLMs were also tested in this study. The PEMs directly generated by LLMs exceeded a 10th grade reading level. Compared to a traditional Google search, LLMs' web browsing feature provided online PEMs with better characteristics and a higher reading level. Original PEMs from Google showed significantly improved readability after LLM conversion, with DeepSeek-R1 achieving the greatest reduction in reading level from 10.59 ± 2.20 to 7.01 ± 0.91 (P < 0.001). Prompt engineering also showed statistically significant results in their effects on LLM conversion, and Zero-shot-Cot (APE) successfully achieving target readability below the sixth grade reading level. Besides, the LLMs' simplified Chinese conversion, as well as the LLMs conversion of other original Chinese PEMs, both showed that they meet the recommended standards for reading levels in multiple dimensions. LLMs can significantly enhance the readability of multilingual online PEMs on pediatric cataract. Combining it with web browsing and prompt engineering can further optimize outcomes and advance patient education. This study links LLMs with patient education and demonstrates their potential to significantly improve the readability of online PEMs.

  • Research Article
  • Cite Count Icon 1
  • 10.1111/eje.13073
Evaluating the Performance of Large Language Models (LLMs) in Answering and Analysing the Chinese Dental Licensing Examination.
  • Jan 31, 2025
  • European journal of dental education : official journal of the Association for Dental Education in Europe
  • Yu-Tao Xiong + 6 more

This study aimed to simulate diverse scenarios of students employing LLMs for CDLE examination preparation, providing a detailed evaluation of their performance in medical education. A stratified random sampling strategy was implemented to select and subsequently revise 200 questions from the CDLE. Seven LLMs, recognised for their exceptional performance in the Chinese domain, were selected as test subjects. Three distinct testing scenarios were constructed: answering questions, explaining questions and adversarial testing. The evaluation metrics included accuracy, agreement rate and teaching effectiveness score. Wald χ2 tests and Kruskal-Wallis tests were employed to determine whether the differences among the LLMs across various scenarios and before and after adversarial testing were statistically significant. The majority of the tested LLMs met the passing threshold on the CDLE benchmark, with Doubao-pro 32k and Qwen2-72b (81%) achieving the highest accuracy rates. Doubao-pro 32k demonstrated the highest 98% agreement rate with the reference answers when providing explanations. Although statistically significant differences existed among various LLMs in their teaching effectiveness scores based on the Likert scale, all these models demonstrated a commendable ability to deliver comprehensible and effective instructional content. In adversarial testing, GPT-4 exhibited the smallest decline in accuracy (2%, p = 0.623), while ChatGLM-4 demonstrated the least reduction in agreement rate (14.6%, p = 0.001). LLMs trained on Chinese corpora, such as Doubao-pro 32k, demonstrated superior performance compared to GPT-4 in answering and explaining questions, with no statistically significant difference. However, during adversarial testing, all models exhibited diminished performance, with GPT-4 displaying comparatively greater robustness. Future research should further investigate the interpretability of LLM outputs and develop strategies to mitigate hallucinations generated in medical education.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 19
  • 10.1007/s00464-024-10720-2
Large language models and bariatric surgery patient education: a comparative readability analysis of GPT-3.5, GPT-4, Bard, and online institutional resources
  • Mar 12, 2024
  • Surgical Endoscopy
  • Nitin Srinivasan + 5 more

BackgroundThe readability of online bariatric surgery patient education materials (PEMs) often surpasses the recommended 6th grade level. Large language models (LLMs), like ChatGPT and Bard, have the potential to revolutionize PEM delivery. We aimed to evaluate the readability of PEMs produced by U.S. medical institutions compared to LLMs, as well as the ability of LLMs to simplify their responses.MethodsResponses to frequently asked questions (FAQs) related to bariatric surgery were gathered from top-ranked health institutions. FAQ responses were also generated from GPT-3.5, GPT-4, and Bard. LLMs were then prompted to improve the readability of their initial responses. The readability of institutional responses, initial LLM responses, and simplified LLM responses were graded using validated readability formulas. Accuracy and comprehensiveness of initial and simplified LLM responses were also compared.ResultsResponses to 66 FAQs were included. All institutional and initial LLM responses had poor readability, with average reading levels ranging from 9th grade to college graduate. Simplified responses from LLMs had significantly improved readability, with reading levels ranging from 6th grade to college freshman. When comparing simplified LLM responses, GPT-4 responses demonstrated the highest readability, with reading levels ranging from 6th to 9th grade. Accuracy was similar between initial and simplified responses from all LLMs. Comprehensiveness was similar between initial and simplified responses from GPT-3.5 and GPT-4. However, 34.8% of Bard's simplified responses were graded as less comprehensive compared to initial.ConclusionOur study highlights the efficacy of LLMs in enhancing the readability of bariatric surgery PEMs. GPT-4 outperformed other models, generating simplified PEMs from 6th to 9th grade reading levels. Unlike GPT-3.5 and GPT-4, Bard’s simplified responses were graded as less comprehensive. We advocate for future studies examining the potential role of LLMs as dynamic and personalized sources of PEMs for diverse patient populations of all literacy levels.

  • Research Article
  • Cite Count Icon 5
  • 10.1016/j.heliyon.2024.e34391
Benchmarking four large language models’ performance of addressing Chinese patients' inquiries about dry eye disease: A two-phase study
  • Jul 1, 2024
  • Heliyon
  • Runhan Shi + 13 more

Benchmarking four large language models’ performance of addressing Chinese patients' inquiries about dry eye disease: A two-phase study

  • Research Article
  • 10.1016/j.jpedsurg.2025.162654
Designing patient-centered communication aids in pediatric surgery using large language models.
  • Sep 8, 2025
  • Journal of pediatric surgery
  • Arya S Rao + 10 more

Designing patient-centered communication aids in pediatric surgery using large language models.

  • Research Article
  • 10.3390/dj13060271
Evaluation of Large Language Model Performance in Answering Clinical Questions on Periodontal Furcation Defect Management.
  • Jun 18, 2025
  • Dentistry journal
  • Georgios S Chatzopoulos + 3 more

Background/Objectives: Large Language Models (LLMs) are artificial intelligence (AI) systems with the capacity to process vast amounts of text and generate human-like language, offering the potential for improved information retrieval in healthcare. This study aimed to assess and compare the evidence-based potential of answers provided by four LLMs to common clinical questions concerning the management and treatment of periodontal furcation defects. Methods: Four LLMs-ChatGPT 4.0, Google Gemini, Google Gemini Advanced, and Microsoft Copilot-were used to answer ten clinical questions related to periodontal furcation defects. The LLM-generated responses were compared against a "gold standard" derived from the European Federation of Periodontology (EFP) S3 guidelines and recent systematic reviews. Two board-certified periodontists independently evaluated the answers for comprehensiveness, scientific accuracy, clarity, and relevance using a predefined rubric and a scoring system of 0-10. Results: The study found variability in LLM performance across the evaluation criteria. Google Gemini Advanced generally achieved the highest average scores, particularly in comprehensiveness and clarity, while Google Gemini and Microsoft Copilot tended to score lower, especially in relevance. However, the Kruskal-Wallis test revealed no statistically significant differences in the overall average scores among the LLMs. Evaluator agreement and intra-evaluator reliability were high. Conclusions: While LLMs demonstrate the potential to answer clinical questions related to furcation defect management, their performance varies. LLMs showed different comprehensiveness, scientific accuracy, clarity, and relevance degrees. Dental professionals should be aware of LLMs' capabilities and limitations when seeking clinical information.

  • Research Article
  • Cite Count Icon 6
  • 10.1093/sexmed/qfae055
Prompt matters: evaluation of large language model chatbot responses related to Peyronie's disease.
  • Aug 13, 2024
  • Sexual medicine
  • Christopher J Warren + 7 more

Despite direct access to clinicians through the electronic health record, patients are increasingly turning to the internet for information related to their health, especially with sensitive urologic conditions such as Peyronie's disease (PD). Large language model (LLM) chatbots are a form of artificial intelligence that rely on user prompts to mimic conversation, and they have shown remarkable capabilities. The conversational nature of these chatbots has the potential to answer patient questions related to PD; however, the accuracy, comprehensiveness, and readability of these LLMs related to PD remain unknown. To assess the quality and readability of information generated from 4 LLMs with searches related to PD; to see if users could improve responses; and to assess the accuracy, completeness, and readability of responses to artificial preoperative patient questions sent through the electronic health record prior to undergoing PD surgery. The National Institutes of Health's frequently asked questions related to PD were entered into 4 LLMs, unprompted and prompted. The responses were evaluated for overall quality by the previously validated DISCERN questionnaire. Accuracy and completeness of LLM responses to 11 presurgical patient messages were evaluated with previously accepted Likert scales. All evaluations were performed by 3 independent reviewers in October 2023, and all reviews were repeated in April 2024. Descriptive statistics and analysis were performed. Without prompting, the quality of information was moderate across all LLMs but improved to high quality with prompting. LLMs were accurate and complete, with an average score of 5.5 of 6.0 (SD, 0.8) and 2.8 of 3.0 (SD, 0.4), respectively. The average Flesch-Kincaid reading level was grade 12.9 (SD, 2.1). Chatbots were unable to communicate at a grade 8 reading level when prompted, and their citations were appropriate only 42.5% of the time. LLMs may become a valuable tool for patient education for PD, but they currently rely on clinical context and appropriate prompting by humans to be useful. Unfortunately, their prerequisite reading level remains higher than that of the average patient, and their citations cannot be trusted. However, given their increasing uptake and accessibility, patients and physicians should be educated on how to interact with these LLMs to elicit the most appropriate responses. In the future, LLMs may reduce burnout by helping physicians respond to patient messages.

  • Research Article
  • 10.2196/69955
Enhancing the Readability of Online Patient Education Materials Using Large Language Models: Cross-Sectional Study
  • Jun 4, 2025
  • Journal of Medical Internet Research
  • John Will + 5 more

BackgroundOnline accessible patient education materials (PEMs) are essential for patient empowerment. However, studies have shown that these materials often exceed the recommended sixth-grade reading level, making them difficult for many patients to understand. Large language models (LLMs) have the potential to simplify PEMs into more readable educational content.ObjectiveWe sought to evaluate whether 3 LLMs (ChatGPT [OpenAI], Gemini [Google], and Claude [Anthropic PBC]) can optimize the readability of PEMs to the recommended reading level without compromising accuracy.MethodsThis cross-sectional study used 60 randomly selected PEMs available online from 3 websites. We prompted LLMs to simplify the reading level of online PEMs. The primary outcome was the readability of the original online PEMs compared with the LLM-simplified versions. Readability scores were calculated using 4 validated indices Flesch Reading Ease, Flesch-Kincaid Grade Level, Gunning Fog Index, and Simple Measure of Gobbledygook Index. Accuracy and understandability were also assessed as balancing measures, with understandability measured using the Patient Education Materials Assessment Tool-Understandability (PEMAT-U).ResultsThe original readability scores for the American Heart Association (AHA), American Cancer Society (ACS), and American Stroke Association (ASA) websites were above the recommended sixth-grade level, with mean grade level scores of 10.7,10.0, and 9.6, respectively. After optimization by the LLMs, readability scores significantly improved across all 3 websites when compared with the original text. Compared with the original website, Wilcoxon signed rank test showed ChatGPT improved the readability to 7.6 from 10.1 (P<.001); Gemini, to 6.6 (P<.001); and Claude, to 5.6 (P<.001). Word counts were significantly reduced by all LLMs, with a decrease from a mean range of 410.9-953.9 words to a mean range of 201.9-248.1 words. None of the ChatGPT LLM-simplified PEMs were inaccurate, while 3.3% of Gemini and Claude LLM-simplified PEMs were inaccurate. Baseline understandability scores, as measured by PEMAT-U, were preserved across all LLM-simplified versions.ConclusionsThis cross-sectional study demonstrates that LLMs have the potential to significantly enhance the readability of online PEMs while maintaining accuracy and understandability, making them more accessible to a broader audience. However, variability in model performance and demonstrated inaccuracies underscore the need for human review of LLM output. Further study is needed to explore advanced LLM techniques and models trained for medical content.

  • Research Article
  • 10.1161/circ.152.suppl_3.4369507
Abstract 4369507: Large Language Models for Patient Education for Atrial Fibrillation
  • Nov 4, 2025
  • Circulation
  • Hrishi Paliath-Pathiyal + 10 more

Background: Large language models (LLMs) are used by patients seeking information about atrial fibrillation. More than 1 billion monthly users use 4 common LLMs: ChatGPT, Gemini, Claude.ai, and Meta AI. It is not known, however, how LLM responses to atrial fibrillation inquiries differ by patient gender and ethnic group/race. Methods: The following query was posed to these 4 LLMs: “I am a 68-year-old [racial/ethnic group and gender] with atrial fibrillation. I had a heart attack 2 years ago with coronary artery stents. What can I expect from my cardiologist?” Three ethnic/racial groups (White, African American, and Latinx) and male/female gender were studied . Response analysis: Word Count, Flesch-Kincaid Grade Level (FK), and Cosine Similarity Score. ChatGPT4.5 was used to rate cultural sensitivity. Results: Average word counts: ChatGPT= 312.5, Gemini= 937.7, Claude.ai= 262.5, Meta AI=240 (mean 438.2±304.3). FK scores: ChatGPT=10.7, Gemini=13.3, Claude.ai=30.7, Meta AI=12.4 (mean 16.8±8.5). Meta AI generated the least culturally sensitive (CS) content across all demographic prompts. Word count analysis showed Meta AI and Claude.ai with the shortest responses, Gemini the longest. Cosine score ranged from 71.7%–78.2% (1.00 = perfect; mean 74.5±3.0). Readability analysis showed Claude.ai's responses had the lowest health literacy (beyond college), while ChatGPT’s were most accessible (10th-grade level). ChatGPT and Gemini mentioned CHA2DS2-VASc scores. All LLMs mentioned anticoagulation and antiarrhythmic medications. None mentioned catheter ablation. Of the 4 LLMs, Meta AI mentioned to the lowest extent systemic barriers/social determinants of health relevant to African American or Latinx patients. All except ChatGPT included cultural sensitivity and health issues for Black women. No LLMS included cultural issues for White women. Conclusion: The four LLMs are unique in their responses to queries about atrial fibrillation. As LLMs evolve it will be important to consider these variations to understand their strengths and limitations.

More from: General thoracic and cardiovascular surgery
  • New
  • Research Article
  • 10.1007/s11748-025-02205-3
New risk model for prognostic prediction after surgical aortic valve replacement in hemodialysis patients.
  • Nov 4, 2025
  • General thoracic and cardiovascular surgery
  • Shohei Yamada + 8 more

  • New
  • Research Article
  • 10.1007/s11748-025-02217-z
Comparison of thoracotomy conversion rates and causes between VATS and RATS for primary lung cancer: a retrospective cohort study.
  • Nov 4, 2025
  • General thoracic and cardiovascular surgery
  • Yasuaki Kubouchi + 6 more

  • New
  • Research Article
  • 10.1007/s11748-025-02219-x
Efficacy of total arch replacement with frozen elephant trunk for type B aortic dissection involving left subclavian artery-adjacent entry: a strategy for anatomically challenging cases.
  • Nov 3, 2025
  • General thoracic and cardiovascular surgery
  • Norimasa Haijima + 4 more

  • Research Article
  • 10.1007/s11748-025-02212-4
Mid-term outcomes and hemodynamic performances of Abbott Epic mitral bioprosthesis: a single-center study.
  • Oct 28, 2025
  • General thoracic and cardiovascular surgery
  • Takayuki Gyoten + 9 more

  • Research Article
  • 10.1007/s11748-025-02214-2
Unilateral versus bilateral antegrade cerebral perfusion during aortic arch surgery: an updated meta-analysis of comparative studies.
  • Oct 27, 2025
  • General thoracic and cardiovascular surgery
  • Adham Ahmed + 9 more

  • Addendum
  • 10.1007/s11748-025-02215-1
Correction: Effect of posterior pericardiotomy on atrial fibrillation in minimally invasive direct coronary artery bypass surgery.
  • Oct 21, 2025
  • General thoracic and cardiovascular surgery
  • Cüneyt Narin + 1 more

  • Research Article
  • 10.1007/s11748-025-02216-0
Left atrial appendage blood flow analysis using four-dimensional flow magnetic resonance imaging.
  • Oct 18, 2025
  • General thoracic and cardiovascular surgery
  • Akihito Ohkawa + 11 more

  • Research Article
  • 10.1007/s11748-025-02207-1
Analysis of prognostic factors after pulmonary resection for metastatic breast cancer: a 23-year single-institution retrospective study.
  • Oct 16, 2025
  • General thoracic and cardiovascular surgery
  • Ryusei Yoshino + 6 more

  • Research Article
  • 10.1007/s11748-025-02213-3
Prognostic significance of postoperative serum C-reactive protein levels after minimally invasive esophagectomy for esophageal cancer.
  • Oct 14, 2025
  • General thoracic and cardiovascular surgery
  • Hirotaka Ishida + 9 more

  • Research Article
  • 10.1007/s11748-025-02208-0
Outcomes of heart transplantation using ECMO-supported donation in brain dead donors.
  • Oct 11, 2025
  • General thoracic and cardiovascular surgery
  • Soojin Lee + 5 more

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon