Abstract
Abstract Background Amidst the backdrop of staff shortages in healthcare systems, patients and their families are increasingly turning to chatbots, powered by Large Language Models (LLMs), for information about their medical conditions. These AI-driven chatbots, capable of generating human-like responses across a broad range of topics, have become a prevalent tool in the healthcare landscape. Given the proliferation of these chatbots, it is crucial to evaluate the quality and accuracy of the responses they provide. Methods We selected five freely accessible chatbots (Bard, Microsoft Copilot, PiAI, ChatGPT, and ChatSpot) for our study. These chatbots were posed questions spanning three medical fields: cardiology, cardio-oncology, and cardio-rheumatology. The responses generated by the chatbots were then compared against established guidelines from the European Society of Cardiology, American Academy of Dermatology, and American Society of Clinical Oncology. In addition to the content, the readability of the responses was evaluated using four different readability scales: the Flesch Reading Scale, Gunning Fog Scale Level, Flesch-Kincaid Grade Level, and Dale-Chall Score. To assess the accuracy of the responses in accordance with the medical guidelines, two independent medical professionals rated them on a 3-point Likert scale (0 - incorrect, 1 - partially correct or incomplete, 2 - correct). This allowed us to gauge the compliance of the chatbot responses with the medical guidelines. Results Results: In our study, we posed a total of 45 questions to each of the chatbots. Out of the five chatbots, Microsoft Copilot, PiAI, and ChatGPT were able to respond to all the questions. The length of the responses varied, with PiAI providing the shortest average response length of 7.26 words, and Bard providing the longest at 18.9 words. In terms of readability, the Flesch Reading Ease Scale scores ranged from 17.67 (ChatGPT) to 39.34 (Bard), indicating the relative complexity of the responses. The Flesch-Kincaid Grade Level, which reflects the academic grade level required to comprehend the text, ranged from 14.02 (PiAI) to 15.97 (ChatGPT). The Gunning Fog Scale Level, another measure of readability, varied from 15.77 (Bard) to 19.73 (ChatGPT). Lastly, the Dale-Chall Score, which assesses the understandability of the text, ranged from 10.24 (Bard) to 11.87 (ChatGPT). These results highlight the variability in the readability and complexity of responses generated by different chatbots. Readability analysis is presented in table 1. Conclusion This study indicates that chatbots vary in length, quality and readability. They answer each question in their way, based on data they have pulled from the network. Our data suggests that people who want information from a chatbot need to be careful and verify the answers they get.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.