Abstract

To investigate the ability of generative artificial intelligence models to answer ophthalmology board style questions DESIGN: Experimental study. This study evaluated three large language models (LLMs) with chat interfaces, Bing Chat (Microsoft) and ChatGPT 3.5 and 4.0 (OpenAI), using 250 questions from the Basic Science and Clinical Science (BCSC) Self-Assessment Program (SAP). While ChatGPT is trained on information last updated in 2021, Bing Chat incorporates more recently indexed internet search to generate its answers. Performance was compared to human respondents. Questions were categorized by complexity and patient care phase, and instances of information fabrication or non-logical reasoning were documented. Primary outcome: response accuracy. performance in question subcategories and hallucination frequency. Human respondents had an average accuracy of 72.2%. ChatGPT-3.5 scored the lowest (58.8%), while ChatGPT-4.0 (71.6%) and Bing Chat (71.2%) performed comparably. ChatGPT-4.0 excelled in workup-type questions (OR = 3.89, 95% CI 1.19-14.73, p = 0.03) compared with diagnostic questions, but struggled with image interpretation (OR = 0.14, 95% CI 0.05-0.33, p < 0.01) when compared with single step reasoning questions. Against single step questions, Bing Chat also faced difficulties with image interpretation (OR = 0.18, 95% CI 0.08-0.44, p < 0.01) and multi-step reasoning (OR = 0.30, 95% CI 0.11-0.84, p = 0.02). ChatGPT-3.5 had the highest rate of hallucinations or non-logical reasoning (42.4%), followed by ChatGPT-4.0 (18.0%) and Bing Chat (25.6%). LLMs (particularly ChatGPT-4.0 and Bing Chat) can perform similarly with human respondents answering questions from the BCSC SAP. The frequency of hallucinations and non-logical reasoning suggest room for improvement in the performance of conversational agents in the medical domain.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call