Cataracts are a significant cause of blindness. While individuals frequently turn to the Internet for medical advice, distinguishing reliable information can be challenging. Large language models (LLMs) have attracted attention for generating accurate, human-like responses that may be used for medical consultation. However, a comprehensive assessment of LLMs' accuracy within specific medical domains is still lacking. We compiled 46 commonly inquired questions related to cataract care, categorized into six domains. Each question was presented to the LLMs, and three consultant-level ophthalmologists independently assessed the accuracy of their responses on a three-point scale (poor, borderline, good) and their comprehensiveness on a five-point scale. A majority consensus approach established the final rating for each response. Responses rated as 'Poor' were prompted for self-correction and reassessed. For accuracy, ChatGPT-4o and Google Bard both achieved average sum scores of 8.7 (out of 9), followed by ChatGPT-3.5, Bing Chat, Llama 2, and Wenxin Yiyan. In consensus-based ratings, ChatGPT-4o outperformed Google Bard in the 'Good' rating. For completeness, ChatGPT-4o had the highest average sum score of 13.22 (out of 15), followed by Google Bard, ChatGPT-3.5, Llama 2, Bing Chat, and Wenxin Yiyan. Detailed performance data reveal nuanced differences in model capabilities. In the 'Prevention' domain, apart from Wenxin Yiyan, all other models were rated as 'Good'. All models showed improvement in self-correction. Bard and Bing improved 1/1 from 'Poor' to better, Llama improved 3/4, and Wenxin Yiyan improved 4/5. Our findings emphasize the potential of LLMs, particularly ChatGPT-4o, to deliver accurate and comprehensive responses to cataract-related queries, especially in prevention, indicating potential for medical consultation. Continuous efforts to enhance LLMs' accuracy through ongoing strategies and evaluations are essential.
Read full abstract