Background It is recognisedthat large language models (LLMs) may aid medical education by supporting the understanding of explanations behind answers to multiple choice questions. This study aimed to evaluate the efficacy of LLM chatbots ChatGPT and Bard in answering an Intermediate Life Support pre-course multiple choice question (MCQs) test developed by the Resuscitation Council UK focused on managing deteriorating patients and identifying causes and treating cardiac arrest. We assessed the accuracy of responses and quality of explanations to evaluate the utility of the chatbots. Methods The performance of the AI chatbots ChatGPT-3.5 and Bard were assessed on their ability to choose the correct answer and provide clear comprehensive explanations in answering MCQs developed by the Resuscitation Council UK for their Intermediate Life Support Course. Ten MCQs were tested with a total score of 40, with one point scored for each accurate response to each statement a-d. In a separate scoring, questions were scored out of 1 if all sub-statements a-d were correct, to give a total score out of 10 for the test. The explanations provided by the AI chatbots were evaluated by three qualified physicians as per a rating scale from 0-3 for each overall question and median rater scores calculated and compared. The Fleiss multi-rater kappa (κ) was used to determine the score agreement among the three raters. Results When scoring each overall question to give a total score out of 10, Bard outperformed ChatGPT although the difference was not significant (p=0.37). Furthermore, there was no statistically significant difference in the performance of ChatGPT compared to Bard when scoring each sub-question separately to give a total score out of 40 (p=0.26). The qualities of explanations were similar for both LLMs. Importantly, despite answering certain questions incorrectly, both AI chatbots provided some useful correct information in their explanations of the answers to these questions. The Fleiss multi-rater kappa was 0.899 (p<0.001) for ChatGPT and 0.801 (p<0.001) for Bard. Conclusions The performances of both Bard and ChatGPT were similar in answering the MCQs with similar scores achieved. Notably, despite having access to data across the web, neither of the LLMs answered all questions accurately. This suggests that there is still learning required of AI models in medical education.
Read full abstract