Assessments in medical education play a central role in evaluating trainees' progress and eventual competence. Generative artificial intelligence (AI) is finding an increasing role in clinical care and medical education. The objective of this study is to evaluate the ability of the large language model ChatGPT to generate exam questions that are discriminating in the evaluation of graduating urology residents. Graduating urology residents representing all Canadian training programs gather yearly for a mock exam that simulates their upcoming board certifying exam. The exam consists of a written multiple-choice questions (MCQs) exam, and an oral OSCE. In 2023, ChatGPT Version 4 was used to generate 20 MCQs that were added to the written component. ChatGPT was asked to use Campbell-Walsh Urology, AUA, and CUA guidelines as resources. Psychometric analysis of the ChatGPT MCQs was conducted. The MCQs were also researched by 3 faculty for face validity and to ascertain if they came from a valid source. The mean score of the 35 exam takers on the ChatGPT MCQs was 60.7% versus 61.1% for the overall exam. 25% of ChatGPT MCQs showed a discriminating index > 0.3, the threshold for questions that properly discriminate between high and low exam performers. 25% of ChatGPT MCQs showed a point biserial > 0.2, which is considered a high correlation with overall performance on the exam. The assessment by faculty found that ChatGPT MCQs often provided incomplete information in the stem, provided multiple potentially correct answers, were sometimes not rooted in the literature. 35% of the MCQs generated by ChatGPT provided wrong answers to stems. Despite what appears to be similar performance on ChatGPT MCQs and the overall exam, ChatGPT MCQs tend not to be highly discriminating. Poorly phrased questions with potential for AI hallucinations are ever present. Careful vetting for quality of ChatGPT questions should be undertaken before their use on assessments in urology training exams.
Read full abstract