Abstract
Introduction: The text generation abilities of large language models (LLMs), such as ChatGPT, can streamline the process of research and potentially expedite reviewing literature. Methods: We evaluated the capabilities of LLMs and their major susceptibilities—hallucination and goal misgeneralization—in reviewing cardiology literature. We asked 30 comprehensive questions (n=600) regarding findings from notable cardiovascular studies and randomized controlled trials (n=20) from 2016-2020 to GPT-4, GPT-3.5, and LLaMa 3. The question groups followed a format where the initial question checked for accuracy in identification, second for misgeneralization, and third for hallucination (Table 1). All questions were communicated in individual interfaces of the models to retain independence and prevent bias. Responses were then reviewed for accuracy, misgeneralizations, and hallucinations. Results: GPT-4 and GPT-3.5 did not significantly differ in accuracy (55% vs. 40%, p=0.527), with a low association (φ=0.1) and an odds ratio (OR) of 1.83 (95% CI: 0.522-6.434). Similar results were observed between GPT-4 and LLaMa 3 (55% vs. 30%, p=0.201), with an OR of 2.85 (95% CI: 0.776-10.467). For generalization, GPT-4 and GPT-3.5 also did not significantly differ (30% vs. 35%, p=0.999), with a low association (φ=0.001) and an OR of 0.80 (95% CI: 0.211-2.998). Likewise, GPT-4 and LLaMa 3 showed no significant difference (30% vs. 25%, p=0.999), with an OR of 1.29 (95% CI: 0.319-5.174). For non-hallucinations, GPT-4 and GPT-3.5 did not significantly differ (30% vs. 60%, p=0.187), with a low association (φ=0.214) and an OR of 0.33 (95% CI: 0.088-1.256). Lastly, GPT-4 and LLaMa 3 showed no significant difference (30% vs. 25%, p=0.836), with an OR of 1.50 (95% CI: 0.36-6.13). Overall, all models did not differ in accuracy, misgeneralization, and hallucination. Given the moderate susceptibilities of these models, LLMs require more training before realistic usage in cardiology. Conclusions: LLMs show potential for identifying cardiology studies, but moderate hallucination and misgeneralization render them unsuitable for this purpose. Without resolution of susceptibilities (<5-10%), LLMs remain unrealistic for accurately reviewing literature.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.