Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard

Zhi Wei Lim,Krithi Pushpanathan,Samantha Min Er Yew,Yien Lai,Chen-Hsin Sun,Janice Sing Harn Lam,David Ziyou Chen,Jocelyn Hui Lin Goh,Marcus Chun Jin Tan,Bin Sheng,Ching-Yu Cheng,Victor Teck Chang Koh,Yih-Chung Tham

doi:10.1016/j.ebiom.2023.104770

Abstract

Large language models (LLMs) are garnering wide interest due to their human-like and contextually relevant responses. However, LLMs' accuracy across specific medical domains has yet been thoroughly evaluated. Myopia is a frequent topic which patients and parents commonly seek information online. Our study evaluated the performance of three LLMs namely ChatGPT-3.5, ChatGPT-4.0, and Google Bard, in delivering accurate responses to common myopia-related queries. We curated thirty-one commonly asked myopia care-related questions, which were categorised into six domains-pathogenesis, risk factors, clinical presentation, diagnosis, treatment and prevention, and prognosis. Each question was posed to the LLMs, and their responses were independently graded by three consultant-level paediatric ophthalmologists on a three-point accuracy scale (poor, borderline, good). A majority consensus approach was used to determine the final rating for each response. 'Good' rated responses were further evaluated for comprehensiveness on a five-point scale. Conversely, 'poor' rated responses were further prompted for self-correction and then re-evaluated for accuracy. ChatGPT-4.0 demonstrated superior accuracy, with 80.6% of responses rated as 'good', compared to 61.3% in ChatGPT-3.5 and 54.8% in Google Bard (Pearson's chi-squared test, all p≤0.009). All three LLM-Chatbots showed high mean comprehensiveness scores (Google Bard: 4.35; ChatGPT-4.0: 4.23; ChatGPT-3.5: 4.11, out of a maximum score of 5). All LLM-Chatbots also demonstrated substantial self-correction capabilities: 66.7% (2 in 3) of ChatGPT-4.0's, 40% (2 in 5) of ChatGPT-3.5's, and 60% (3 in 5) of Google Bard's responses improved after self-correction. The LLM-Chatbots performed consistently across domains, except for 'treatment and prevention'. However, ChatGPT-4.0 still performed superiorly in this domain, receiving 70% 'good' ratings, compared to 40% in ChatGPT-3.5 and 45% in Google Bard (Pearson's chi-squared test, all p≤0.001). Our findings underscore the potential of LLMs, particularly ChatGPT-4.0, for delivering accurate and comprehensive responses to myopia-related queries. Continuous strategies and evaluations to improve LLMs' accuracy remain crucial. Dr Yih-Chung Tham was supported by the National Medical Research Council of Singapore (NMRC/MOH/HCSAINV21nov-0001).

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard

Abstract

Talk to us

Similar Papers

More From: eBioMedicine

Lead the way for us

Journal: eBioMedicine	Publication Date: Aug 23, 2023
Citations: 104

Similar Papers

Assessment of Large Language Models in Cataract Care Information Provision: A Quantitative Comparison.
Zichang Su ... Juan Ye
Ophthalmology and therapy | VOL. -
Zichang Su, et. al.Zichang Su ... Juan Ye
08 Nov 2024
Ophthalmology and therapy | VOL. -

The Application of Large Language Models in Gastroenterology: A Review of the Literature.
Marcello Maida ... Daryl Ramai
Cancers | VOL. 16
Marcello Maida, et. al.Marcello Maida ... Daryl Ramai
28 Sep 2024
Cancers | VOL. 16

Can Artificial Intelligence Pass the American Board of Orthopaedic Surgery Examination? Orthopaedic Residents Versus ChatGPT.
Zachary C Lum
Clinical Orthopaedics & Related Research | VOL. 481
Zachary C LumZachary C Lum
23 May 2023
Clinical Orthopaedics & Related Research | VOL. 481

Applications and Concerns of ChatGPT and Other Conversational Large Language Models in Health Care: Systematic Review.
Leyao Wang ... Zhijun Yin
Journal of medical Internet research | VOL. 26
Leyao Wang, et. al.Leyao Wang ... Zhijun Yin
07 Nov 2024
Journal of medical Internet research | VOL. 26

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard

Abstract

Talk to us

Similar Papers

More From: eBioMedicine