Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Exploring LLM-based chatbot effectiveness in answering questions related to the risks and benefits of orthognathic treatment: a cross-sectional study

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

ObjectiveTo assess the accuracy, reliability, quality, and readability of responses generated by three large language model chatbots, ChatGPT4o, Microsoft Copilot, and Google Gemini 2.5 Flash, when answering common patient questions about the risks and benefits of orthognathic treatment.Materials and MethodsTwenty frequently searched questions were identified via Google and entered into each chatbot. Responses were evaluated using validated scoring systems for accuracy, modified DISCERN, global quality scale (GQS), and Flesch Reading Ease. Intra- and inter-rater reliability was assessed using Cohen’s kappa and intra-class correlation coefficients. Non-parametric tests were applied due to non-normal data distribution.ResultsCopilot achieved the highest reliability and quality scores, with significant differences observed in modified DISCERN (P < 0.001) and GQS (P = 0.046). Post hoc tests confirmed Copilot significantly outperformed ChatGPT. Accuracy scores did not differ significantly (P = 0.704). Readability varied significantly with Gemini and ChatGPT producing more accessible responses than Copilot. Intra- and inter-rater reliability scores were substantial to excellent for categorical measures and excellent for readability.ConclusionsCopilot provided the most reliable and high-quality responses, whilst ChatGPT and Gemini offered greater readability ease. Despite these strengths, variability in accuracy and reliability highlights the need for caution. Chatbots should be considered as supplementary tools, and patients should verify information with qualified professionals.

Similar Papers
  • Research Article
  • Cite Count Icon 183
  • 10.1597/14-027
The Americleft Speech Project: A Training and Reliability Study.
  • Jan 1, 2016
  • The Cleft palate-craniofacial journal : official publication of the American Cleft Palate-Craniofacial Association
  • Kathy L Chapman + 11 more

To describe the results of two reliability studies and to assess the effect of training on interrater reliability scores. The first study (1) examined interrater and intrarater reliability scores (weighted and unweighted kappas) and (2) compared interrater reliability scores before and after training on the use of the Cleft Audit Protocol for Speech-Augmented (CAPS-A) with British English-speaking children. The second study examined interrater and intrarater reliability on a modified version of the CAPS-A (CAPS-A Americleft Modification) with American and Canadian English-speaking children. Finally, comparisons were made between the interrater and intrarater reliability scores obtained for Study 1 and Study 2. The participants were speech-language pathologists from the Americleft Speech Project. In Study 1, interrater reliability scores improved for 6 of the 13 parameters following training on the CAPS-A protocol. Comparison of the reliability results for the two studies indicated lower scores for Study 2 compared with Study 1. However, this appeared to be an artifact of the kappa statistic that occurred due to insufficient variability in the reliability samples for Study 2. When percent agreement scores were also calculated, the ratings appeared similar across Study 1 and Study 2. The findings of this study suggested that improvements in interrater reliability could be obtained following a program of systematic training. However, improvements were not uniform across all parameters. Acceptable levels of reliability were achieved for those parameters most important for evaluation of velopharyngeal function.

  • Research Article
  • 10.3389/froh.2026.1813936
Feasibility and exploratory assessment of large language models for pediatric dentistry queries: a comparative study.
  • Apr 24, 2026
  • Frontiers in oral health
  • Sanjeev B Khanagar + 7 more

Large Language Models (LLMs) are increasingly used by caregivers to obtain pediatric health information. However, concerns persist regarding the accuracy, reliability, and readability of AI-generated content, especially in pediatric dentistry, where caregiver comprehension is crucial. To conduct an exploratory feasibility assessment of evaluating accuracy, quality, reliability, and readability of responses generated by ChatGPT-4, Google Gemini, and DeepSeek to common pediatric dentistry queries. This exploratory comparative cross-sectional feasibility study utilized 15 patient-oriented pediatric dentistry questions identified through structured searches and expert screening. Each question was submitted verbatim to ChatGPT-4, Gemini, and DeepSeek under standardized conditions. Responses were independently evaluated by three calibrated pediatric dentistry experts using the Global Quality Scale (GQS), a modified DISCERN tool, and the Accuracy of Information Index (AOI). Readability was assessed using the Flesch Reading Ease Score (FRES) and the Flesch-Kincaid Grade Level (FKGL). Inter-examiner reliability was assessed using intraclass correlation coefficients (ICC). Statistical comparisons between LLMs were performed using a fixed-effects model with post-hoc pairwise analysis. Inter-examiner agreement was further evaluated using Bland-Altman analysis. A p-value of <0.05 was considered statistically significant. Overall scoring was consistent across examiners, with minor variability observed across domains. A linear mixed-effects model conducted separately for each domain demonstrated that LLM type significantly influenced GQS scores (F = 7.90, p = 0.00), with Gemini and DeepSeek outperforming ChatGPT. No significant differences were observed for AOI (p = 0.44) and DISCERN (p = 0.06). Bland-Altman analysis indicated minimal inter-examiner bias; however, the limits of agreement were relatively wide considering the scale range, reflecting variability between individual ratings. Single-measure ICC demonstrated poor agreement (ICC = 0.26), while higher reliability observed when scores were averaged (ICC = 0.90). This study offers an exploratory feasibility assessment of LLM evaluation in pediatric dentistry. While the models generally produced high-quality outputs, variations in accuracy, readability, and significant inter-examiner variability highlight important methodological challenges. These findings represent preliminary groundwork and require validation in larger, clinically diverse, real-world settings. LLMs may serve as supportive informational tools; however, their outputs should be interpreted cautiously and used to complement, not replace professional clinical judgment.

  • Conference Article
  • 10.11159/icsta24.108
Evaluation of Biostatistics Contents in ChatGPT: A Descriptive Study
  • Aug 1, 2024
  • Arzu Baygül Eden + 2 more

This study aims to evaluate the reliability and quality of ChatGPT within the context of biostatistics.The findings will enlighten researchers and clinicians about the advantages and limitations of employing ChatGPT for biostatistical information.It is important to note that this study does not extensively assess advanced biostatistical methods but rather focuses on the question: "Can researchers/clinicians dependably and effortlessly use ChatGPT?" ChatGPT was presented with Frequently Asked Questions (FAQ) in biostatistics, and responses to 20 questions were blindly evaluated by three biostatisticians holding PhDs for reliability and quality.Ratings were based on a reliability score (1 to 7), Global Quality Scale (GQS) (1 to 5), Flesch Reading Ease Score (FRES), and the Intraclass Correlation Coefficient (ICC).Moderate ICC values were observed between raters for reliability (0.646) and GQS (0.545), with a significant correlation between the reliability score and GQS (r=0.708;p<0.001).While ChatGPT provided reliable, high-quality content in response to biostatistics FAQs, it is noted that it cannot replace biostatistics experts.The readability of the content was generally challenging (FRES score: 17.212.04).ChatGPT shows promise as a supplementary tool for accessing biostatistics information but should be used alongside human expertise.Future research could explore ways to enhance its readability and compare its performance with alternative sources.

  • Research Article
  • 10.3389/fmed.2026.1752664
Artificial intelligence vs. human evaluation of anesthesia education videos: a comparative analysis using validated quality scales.
  • Jan 1, 2026
  • Frontiers in medicine
  • Kubra Taskin + 1 more

YouTube has become an increasingly popular platform for medical education, yet the accuracy and educational quality of anesthesia-related videos remain uncertain. While human experts have traditionally assessed video quality using validated scales such as DISCERN, JAMA, and the Global Quality Scale (GQS), artificial intelligence (AI) models-particularly large language models (LLMs)-now offer new possibilities for scalable, objective content evaluation. This study aimed to compare the educational quality of anesthesia education videos produced by humans and AI, and to examine the level of agreement between human expert ratings and ChatGPT-5 evaluations. In this cross-sectional analytical study, forty YouTube videos were analyzed: 20 produced by human educators and 20 generated using AI tools. Each video was independently assessed by two anesthesiologists and by ChatGPT-5 Plus (OpenAI, 2025) using DISCERN, JAMA, and GQS criteria. Inter-rater reliability between human evaluators was determined using the Intraclass Correlation Coefficient (ICC), and correlations between human and AI ratings were analyzed with Spearman's rho. Human-generated videos scored significantly higher than AI-generated ones in DISCERN (68.45 ± 4.60 vs. 62.77 ± 7.32, p = 0.0044, Cohen's d = 0.82) and JAMA (3.70 ± 0.41 vs. 3.23 ± 0.77, p = 0.0446, Cohen's d = 0.71) scores, whereas no significant difference was observed in GQS scores (p = 0.3033). Inter-rater reliability between human experts was excellent (ICC = 0.81-0.86, p < 0.001). Strong correlations were found between ChatGPT-5 and the human mean scores for all scales (ρ = 0.897 for DISCERN, ρ = 0.785 for GQS, ρ = 0.765 for JAMA; p < 0.001), indicating high agreement between AI and human evaluations. AI-based models such as ChatGPT-5 show potential to approximate human expert judgment in evaluating educational content. While human-generated videos remain superior in terms of source transparency and ethical reporting, AI-generated content approaches human quality in structural organization and linguistic fluency. These findings suggest that AI-assisted evaluation systems may serve as standardized, efficient tools for quality screening of large-scale educational video archives in medical education.

  • Research Article
  • Cite Count Icon 3
  • 10.3390/healthcare13212670
Balancing Accuracy and Readability: Comparative Evaluation of AI Chatbots for Patient Education on Rotator Cuff Tears
  • Oct 23, 2025
  • Healthcare
  • Ali Can Koluman + 4 more

Background/Objectives: Rotator cuff (RC) tears are a leading cause of shoulder pain and disability. Artificial intelligence (AI)-based chatbots are increasingly applied in healthcare for diagnostic support and patient education, but the reliability, quality, and readability of their outputs remain uncertain. International guidelines (AMA, NIH, European health communication frameworks) recommend that patient materials be written at a 6th–8th grade reading level, yet most online and AI-generated content exceeds this threshold. Methods: We compared responses from three AI chatbots—ChatGPT-4o (OpenAI), Gemini 1.5 Flash (Google), and DeepSeek-V3 (Deepseek AI)—to 20 frequently asked patient questions about RC tears. Four orthopedic surgeons independently rated reliability and usefulness (7-point Likert) and overall quality (5-point Global Quality Scale). Readability was assessed using six validated indices. Statistical analysis included Kruskal–Wallis and ANOVA with Bonferroni correction; inter-rater agreement was measured using intraclass correlation coefficients (ICCs). Results: Inter-rater reliability was good to excellent (ICC 0.726–0.900). Gemini 1.5 Flash achieved the highest reliability and quality, ChatGPT-4o performed comparably but slightly lower in diagnostic content, and DeepSeek-V3 consistently scored lowest in reliability and quality but produced the most readable text (FKGL ≈ 6.5, within the 6th–8th grade target). None of the models reached a Flesch Reading Ease (FRE) score above 60, indicating that even the most readable outputs remained more complex than plain-language standards. Conclusions: Gemini 1.5 Flash and ChatGPT-4o generated more accurate and higher-quality responses, whereas DeepSeek-V3 provided more accessible content. No single model fully balanced accuracy and readability. Clinical Implications: Hybrid use of AI platforms—leveraging high-accuracy models alongside more readable outputs, with clinician oversight—may optimize patient education by ensuring both accuracy and accessibility. Future work should assess real-world comprehension and address the legal, ethical, and generalizability challenges of AI-driven patient education.

  • Research Article
  • Cite Count Icon 1
  • 10.1038/s41598-025-28857-y
Exploring artificial intelligence chatbots in pediatric fluoride education: a cross-sectional study
  • Nov 29, 2025
  • Scientific Reports
  • Nevra Karamüftüoğlu + 2 more

Large language model-based (LLM) chatbots are increasingly integrated into healthcare communication, offering accessible and interactive information. These artificial intelligence (AI) tools have the potential to influence caregiver health behaviors when tailored to user needs and literacy levels. In pediatric dentistry, fluoride remains a cornerstone of caries prevention but is also subject to public concerns and online misinformation, underscoring the need for reliable digital communication. This observational and exploratory study evaluated the performance of three advanced AI chatbots—ChatGPT-4.o, Google Gemini Pro, and DeepSeek V3—in providing fluoride-related information to parents and caregivers in the context of pediatric oral health. Twenty fluoride-related questions, derived from American Academy of Pediatric Dentistry (AAPD) guideline themes, were presented to each chatbot in standardized sessions. Responses were independently evaluated by three blinded reviewers using validated tools: EQIP, DISCERN, Global Quality Scale (GQS), Flesch Reading Ease Score (FRES), Flesch-Kincaid Reading Grade Level (FKRGL), and iThenticate similarity index. These instruments assessed quality, reliability, readability, and originality. Inter-rater reliability was confirmed with intraclass correlation coefficients (ICCs). Statistical analyses were conducted using ANOVA or Kruskal–Wallis tests with appropriate post-hoc methods. ChatGPT-4.o achieved significantly higher EQIP (M = 4.32, SD = 0.43) and DISCERN (M = 4.20, SD = 0.48) scores than Gemini Pro and DeepSeek V3 (p < 0.001), indicating superior reliability and informational quality. While FRES (median = 68.5, p = 0.12) and Similarity Index (≤ 10%, p = 0.54) showed no significant differences, ChatGPT consistently produced more readable and original content. FKRGL differences were borderline (p = 0.041) but not retained after correction, and GQS outcomes were comparable. These findings suggest that ChatGPT’s superior performance is not only statistically significant but also practically relevant for enhancing parental comprehension of fluoride use. Among the evaluated models, ChatGPT-4.o demonstrated the clearest and most reliable fluoride communication. Its higher EQIP and DISCERN scores highlight its potential as a supportive tool for caregiver education in pediatric dentistry. Nonetheless, these systems should be implemented cautiously, complemented with professional oversight, and continuously validated to prevent misinformation and ensure safe clinical integration.

  • Research Article
  • Cite Count Icon 19
  • 10.1016/j.ijmedinf.2025.105948
Readability, accuracy and appropriateness and quality of AI chatbot responses as a patient information source on root canal retreatment: A comparative assessment.
  • Sep 1, 2025
  • International journal of medical informatics
  • Mine Büker + 1 more

Readability, accuracy and appropriateness and quality of AI chatbot responses as a patient information source on root canal retreatment: A comparative assessment.

  • Research Article
  • 10.1016/j.jadohealth.2025.09.015
Evaluation of the Quality, Reliability, and Readability of ChatGPT-4 Responses on Exercise and Rehabilitation Strategies for Adolescent Myositis.
  • Jan 1, 2026
  • The Journal of adolescent health : official publication of the Society for Adolescent Medicine
  • Fulden Sari + 1 more

Evaluation of the Quality, Reliability, and Readability of ChatGPT-4 Responses on Exercise and Rehabilitation Strategies for Adolescent Myositis.

  • Research Article
  • 10.1186/s12903-026-07884-9
Assessing the accuracy, reliability, quality, and readability of artificial intelligence chatbots in patient education: insights from zirconia crowns.
  • Feb 12, 2026
  • BMC oral health
  • Nihan Kaya Acar + 3 more

This study aims to conduct a comparative evaluation of the accuracy, reliability, quality, and readability of chatbot-generated responses from five widely used artificial intelligence (AI) chatbots when addressing frequently asked questions (FAQs) on zirconia and pediatric zirconia crowns. Twenty FAQs on zirconia crowns were derived from two Google searches (“frequently asked questions about zirconia” and “frequently asked questions about pediatric zirconia”). Five chatbots (ChatGPT-5, ChatGPT-4o, Gemini-2.5 Flash, DeepSeek-V3, and Microsoft Copilot) were queried independently, and responses were anonymized and evaluated. Accuracy was rated on a 5-point Likert scale, reliability using a modified DISCERN tool, quality with the Global Quality Scale (GQS), and readability using the Flesch Reading Ease Score (FRES). Statistical analyses included Mann−Whitney U and Kruskal−Wallis tests, with intraclass correlation coefficients (ICC) used for inter-rater reliability. Inter-rater agreement was strong (ICC: 0.78 − 0.98). Gemini achieved the highest scores in accuracy, quality, and reliability (p < 0.001), while ChatGPT-4o, ChatGPT-5, and DeepSeek demonstrated superior readability. Microsoft Copilot scored lowest across domains, particularly in reliability and readability. No significant differences emerged between prosthodontic and pediatric evaluations, except for higher GQS ratings for DeepSeek in pediatric dentistry (p = 0.035). Gemini showed the highest accuracy, reliability, and quality, indicating its strong potential for clinician use in generating evidence-aligned information. ChatGPT-4o, ChatGPT-5, and DeepSeek offered more readable outputs suitable for explanations. Given the substantial between-platform variability, clinicians should critically appraise and, when necessary, adapt chatbot responses to ensure alignment with current evidence before recommending them to patients.

  • Research Article
  • Cite Count Icon 12
  • 10.1089/fpsam.2023.0368
Chatbots as Patient Education Resources for Aesthetic Facial Plastic Surgery: Evaluation of ChatGPT and Google Bard Responses.
  • Jul 1, 2024
  • Facial plastic surgery & aesthetic medicine
  • Neha Garg + 8 more

Background: ChatGPT and Google Bard™ are popular artificial intelligence chatbots with utility for patients, including those undergoing aesthetic facial plastic surgery. Objective: To compare the accuracy and readability of chatbot-generated responses to patient education questions regarding aesthetic facial plastic surgery using a response accuracy scale and readability testing. Method: ChatGPT and Google Bard™ were asked 28 identical questions using four prompts: none, patient friendly, eighth-grade level, and references. Accuracy was assessed using Global Quality Scale (range: 1-5). Flesch-Kincaid grade level was calculated, and chatbot-provided references were analyzed for veracity. Results: Although 59.8% of responses were good quality (Global Quality Scale ≥4), ChatGPT generated more accurate responses than Google Bard™ on patient-friendly prompting (p < 0.001). Google Bard™ responses were of a significantly lower grade level than ChatGPT for all prompts (p < 0.05). Despite eighth-grade prompting, response grade level for both chatbots was high: ChatGPT (10.5 ± 1.8) and Google Bard™ (9.6 ± 1.3). Prompting for references yielded 108/108 of chatbot-generated references. Forty-one (38.0%) citations were legitimate. Twenty (18.5%) provided accurately reported information from the reference. Conclusion: Although ChatGPT produced more accurate responses and at a higher education level than Google Bard™, both chatbots provided responses above recommended grade levels for patients and failed to provide accurate references.

  • Research Article
  • 10.47141/geriatrik.1680977
Evaluation of YouTube Exercise Videos for Fall Prevention in Older Adults: ChatGPT-4.5 Versus Human Experts
  • Dec 31, 2025
  • Geriatrik Bilimler Dergisi
  • Uğur Sözlü + 4 more

Objective: This study aims to determine the potential and limitations of artificial intelligence (AI) in this field by comparing the results of ChatGPT-4.5 and experts in the evaluation of YouTube videos intended for older adults. Materials and Methods: A search was conducted on YouTube using the keyword “fall prevention exercises for elderly,” and the 100 most viewed videos were examined. Of these, 64 videos that met the criteria were included in the study. The comprehensiveness, quality [global quality scale (GQS)], and reliability [Quality Criteria for Consumer Health Information (DISCERN)] of the videos were evaluated by two independent physiotherapists and ChatGPT-4.5. Agreement between the evaluations was tested using Wilcoxon signed-rank test, intraclass correlation coefficient (ICC), and Bland-Altman analyses. Results: No significant differences were found between ChatGPT-4.5 and human experts in terms of comprehensiveness (p=0.242) and GQS (p=0.083) scores, and a high level of agreement was observed (ICC 0.932 and 0.876, respectively). However, in DISCERN scores, ChatGPT-4.5 awarded significantly higher scores than the human experts (p=0.005), and the level of agreement was determined to be excellent (ICC=0.942). Nevertheless, a wide range of differences (limits of agreement: -4.9 to 7.18) was identified. Conclusion: ChatGPT-4.5 can be used as a reliable assessment tool in determining the comprehensiveness and quality levels of fall prevention exercise videos. However, it was concluded that in reliability scoring, AI should be used under expert supervision.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 8
  • 10.3389/fneur.2019.00540
Inter-rater and Intra-rater Reliability of the Chinese Version of the Action Research Arm Test in People With Stroke
  • May 29, 2019
  • Frontiers in Neurology
  • Jiang-Li Zhao + 6 more

Purpose: To detect the inter-rater and intra-rater reliability of the Chinese version of the Action Research Arm Test (C-ARAT) in patients recovering from a first stroke.Methods: Fifty-five participants (45 men and 10 women) with a mean age of 58.67 ± 12.45 (range: 22–80) years and a mean post-stroke interval of 6.47 ± 12.00 (0.5–80) months were enrolled in this study. To determine the inter-rater reliability, the C-ARAT was administered to each participant by two raters (A and B) with varying levels of experience within 1 day. To determine intra-rater reliability, rater A re-administered the C-ARAT to 33 of the 55 participants on the second day. Intra-class correlation coefficients (ICCs) and Bland–Altman plots were used to analyse the inter-rater and intra-rater reliability.Results: Regarding inter-rater reliability, the total, grasping, gripping, pinching, and gross movement scores received respective ICCs of 0.998, 0.997, 0.995, 0.997, and 0.960 (all p < 0.001), indicating excellent inter-rater reliability in stroke patients. Regarding intra-rater reliability, the corresponding ICCs were 0.987, 0.980, 0.975, 0.944, and 0.954 (all p < 0.001), again indicating excellent intra-rater reliability. The Bland–Altman plots yielded a mean difference of 0.15 with 95% limits of agreement (95%LOA) ranging from −2.16 to 2.46 for the inter-rater measurements and a mean difference of −1.06 with 95%LOA ranging from −6.43 to 4.31 for the intra-rater measurement. The C-ARAT thus appeared to be a stable scoring method.Conclusions: The C-ARAT yielded excellent intra-rater and inter-rater reliability for evaluating the paretic upper extremities of stroke patients. Therefore, our results supported the use of the C-ARAT in this population.

  • Research Article
  • 10.1111/1750-3841.71001
Investigating the Readability and Quality of AI Systems to Trending Questions About Food Poisoning.
  • Apr 1, 2026
  • Journal of food science
  • Idris Demirsoy + 1 more

Consumers increasingly turn to artificial intelligence (AI) systems, including search engines and large language models (LLMs), for immediate food safety guidance. However, the reliability and accessibility of this information for critical public health issues, such as food poisoning, remain unassessed. This study benchmarks the performance of major AI systems: Google, ChatGPT, DeepSeek, and Mistral, by simultaneously evaluating the readability and information quality of their responses to frequently asked questions on food poisoning. Readability was assessed using the Flesch-Kincaid Grade Level (FKGL), Simple Measure of Gobbledygook (SMOG), and Gunning-Fog Index (GFI) indices. Information quality was evaluated by independent experts using the validated DISCERN instrument and Global Quality Scale (GQS). Our analysis revealed a critical divergence in platform performance. Google produced the most readable text (FKGL: 9.05) but the lowest quality information (DISCERN: 30-34; GQS: only 3% of ratings were top-score). Conversely, LLMs provided high-quality information (DeepSeek DISCERN: 70-75; ChatGPT: 62) but at significantly higher reading levels (FKGL: 10.01-11.32), exceeding the recommended sixth-grade level. This demonstrates a fundamental trade-off: search engines optimize for brevity and accessibility, whereas dedicated LLMs prioritize comprehensive, reliable content. This forces consumers to choose between understandable but potentially misleading information and accurate but inaccessible guidance. Our findings highlight an urgent need to bridge this gap between readability and quality, calling for the development of AI systems that deliver authoritative, comprehensible food safety advice to protect public health.

  • Research Article
  • 10.1177/10538127261433272
Can AI chatbots guide patients and physicians about neck pain? A reliability and readability comparison of ChatGPT-4 and Gemini.
  • Mar 17, 2026
  • Journal of back and musculoskeletal rehabilitation
  • Dicle Rotinda Ozdas Sevgin + 3 more

BackgroundArtificial intelligence (AI)-based chatbots are increasingly used as sources of medical information. Given the high prevalence of neck pain as a musculoskeletal symptom, patients may commonly consult such tools for health-related guidance.ObjectiveTo evaluate and compare the performance of ChatGPT 4.0 and Google Gemini in addressing commonly asked patient questions and clinical case scenarios related to neck pain, focusing on their accuracy, quality, understandability, readability, reliability, and usability.MethodsTwenty-four patient-oriented questions and four clinical case scenarios regarding neck pain were submitted to ChatGPT 4.0 and Google Gemini. Responses were evaluated using validated tools: modified DISCERN (mDISCERN) for reliability, Global Quality Scale (GQS) for quality, PEMAT-P for understandability and actionability, and Flesch Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKGL) for readability. Case-based responses were assessed for accuracy, safety, and usability on a 7-point Likert scale by two experienced physicians.ResultsGemini demonstrated significantly higher reliability (mDISCERN, p < 0.001), whereas ChatGPT 4.0 had slightly higher, though statistically insignificant, GQS and PEMAT-P scores. Readability metrics were similar: ChatGPT's FRE was 48.78 and FKGL 9.08; Gemini's FRE was 47.12 and FKGL 9.11. Both models' outputs were considered difficult to read. In clinical scenarios, both chatbots showed comparable accuracy, safety, and usability, with minor omissions noted.ConclusionChatGPT 4.0 and Google Gemini provided similar performance in addressing neck pain-related queries. While both may support patient.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 1
  • 10.7759/cureus.45885
Assessment of the Quality and Reliability of Information on YouTube Regarding Angiography and Angioplasty.
  • Sep 25, 2023
  • Cureus
  • Tanya Nazar + 5 more

Introduction Angiography is a method for defining the inner vessel wall and demonstrating flow through the lumen by detecting contrast injection into a blood vessel and projecting it onto a sequence of X-rays. This method is used to image the anatomical and architectural aspects of the vascular system. By employing balloon dilatation and the implantation of stents to widen the stenosed arteries, angioplasty is a form of minimally invasive endovascular treatment used to treat cardiovascular diseases and their consequences. People frequently rely on YouTube as a resource for awareness-raising and marketing activities. Animations and visual explanations can help patients understand the risks and benefits of procedures. Aims To assess the quality and reliability of the information on YouTube about angiography and angioplasty. We assessed quality using the GQS (Global Quality Scale) and reliability via the reliability score. Methodology This is an observational, cross-sectional study without the requirement of an ethics committee. It includes a questionnaire with predetermined criteria like time since upload, popularity, or type of uploader. The study assesses YouTube videos that include criteria using GQS and reliability scores. Responses recorded in Google Sheets were transferred to Microsoft Excel (Redmond, USA). Statistical analysis was performed using IBM Corp. Released 2012. IBM SPSS Statistics for Windows, Version 21.0. Armonk, NY: IBM Corp.All six authors assessed 10 YouTube videos using specific keywords. The study includes videos that meet the inclusion criteria. Videos that did not include the inclusion criteria were excluded. Results After applying inclusion/exclusion criteria, 57 out of 60 videos were included. Of the total videos analyzed, the majority were uploaded by various hospitals and people other than doctors and healthcare organizations. About 78.95% of the videos described the reason for angiography/plasty, followed by the anatomical area involved and the pre-procedural preparation phase. There is a significant increase in the GQS score and reliability score among the videos uploaded by doctors, hospitals, healthcare organizations, and other groups. Conclusions Verified health information should be uploaded responsibly by doctors, hospitals, healthcare organizations, or other agencies on social media like YouTube in a manner that is easy to understand, has a high GQS, and has a high reliability score, as it would make it simpler for the general population or viewers to have access to important health-related content they can rely on. Videos should advise the viewers to contact their doctors for all queries regarding the diagnosis or treatment of their health concerns.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant