Background: The integration of large language models (LLMs) such as ChatGPT into healthcare can have significant implications for patient education and clinical decision-making. Aims: This systematic review and pooled analysis aim to evaluate the accuracy of ChatGPT 3.5 and 4 in answering simple queries across cardiovascular (CV) medicine disciplines. Methods: Literature searches were conducted in PubMed, Embase, and Cochrane Central in May 2024. Keywords included “ChatGPT”, “LLMs”, and “Chat-based artificial intelligence models”. Cross-sectional, peer-reviewed studies published in 2023 and 2024 investigating ChatGPT’s performance in CV medicine-related queries (Table/Figure) were extracted and included. All queries were evaluated by expert physicians in the corresponding fields within each study (and not by our readjudication), and a standardized grading system was employed for pooled analysis using an "accurate" and "inaccurate" grading scale for each answer. Results: Out of 127 identified and screened peer-reviewed studies, fourteen studies involving 542 CV-related queries were included. Pooled analysis revealed an overall accuracy of 84.5% (458/542) (95% CI [81.5, 87.6]). Stratification by model (ChatGPT-4 vs. ChatGPT-3.5) did not show a significant difference in accuracy (p=0.32). Furthermore, no significant differences in accuracies were seen between answers in 2023 and 2024 (p=0.07). The accuracies across the various topics were statistically comparable, except in the field of cardio-oncology, which showed significantly lower accuracy at 68% (p=0.02). Detailed performances per topic are included in the table and figure. Conclusion: ChatGPT demonstrated consistently high accuracy in answering CV-related queries with no significant differences across model versions or years. These results support the potential use of online-chat based LLMs as an informational tool in cardiology.
Read full abstract