Abstract Background Amidst the backdrop of staff shortages in healthcare systems, patients and their families are increasingly turning to chatbots, powered by Large Language Models (LLMs), for information about their medical conditions. These AI-driven chatbots, capable of generating human-like responses across a broad range of topics, have become a prevalent tool in the healthcare landscape. Given the proliferation of these chatbots, it is crucial to evaluate the quality and accuracy of the responses they provide. Methods We selected five freely accessible chatbots (Bard, Microsoft Copilot, PiAI, ChatGPT, and ChatSpot) for our study. These chatbots were posed questions spanning three medical fields: cardiology, cardio-oncology, and cardio-rheumatology. The responses generated by the chatbots were then compared against established guidelines from the European Society of Cardiology, American Academy of Dermatology, and American Society of Clinical Oncology. In addition to the content, the readability of the responses was evaluated using four different readability scales: the Flesch Reading Scale, Gunning Fog Scale Level, Flesch-Kincaid Grade Level, and Dale-Chall Score. To assess the accuracy of the responses in accordance with the medical guidelines, two independent medical professionals rated them on a 3-point Likert scale (0 - incorrect, 1 - partially correct or incomplete, 2 - correct). This allowed us to gauge the compliance of the chatbot responses with the medical guidelines. Results Results: In our study, we posed a total of 45 questions to each of the chatbots. Out of the five chatbots, Microsoft Copilot, PiAI, and ChatGPT were able to respond to all the questions. The length of the responses varied, with PiAI providing the shortest average response length of 7.26 words, and Bard providing the longest at 18.9 words. In terms of readability, the Flesch Reading Ease Scale scores ranged from 17.67 (ChatGPT) to 39.34 (Bard), indicating the relative complexity of the responses. The Flesch-Kincaid Grade Level, which reflects the academic grade level required to comprehend the text, ranged from 14.02 (PiAI) to 15.97 (ChatGPT). The Gunning Fog Scale Level, another measure of readability, varied from 15.77 (Bard) to 19.73 (ChatGPT). Lastly, the Dale-Chall Score, which assesses the understandability of the text, ranged from 10.24 (Bard) to 11.87 (ChatGPT). These results highlight the variability in the readability and complexity of responses generated by different chatbots. Readability analysis is presented in table 1. Conclusion This study indicates that chatbots vary in length, quality and readability. They answer each question in their way, based on data they have pulled from the network. Our data suggests that people who want information from a chatbot need to be careful and verify the answers they get.
Read full abstract