Purpose: There is increasing interest in understanding potential bias in medical education. Prior studies have demonstrated potential disparities in the words used to describe medical students in their evaluations, such as variations by gender/race. 1,2 For example, natural language processing (NLP) has been used to determine narrative differences between medical clerkship evaluations on the basis of gender and under-represented minority status. 2 Females were more commonly described using words for personal attributes, whereas males were more likely to be described using words related to professional competencies. 1 In this study, we used NLP to evaluate potential bias among third-year clinical clerkship evaluations. Method: This study was conducted at the University of California San Diego School of Medicine. Data was extracted from medical evaluation and administrative databases for medical students enrolled in third-year clinical clerkship rotations across 2 academic years (2019–2020 and 2020–2021). For each evaluation, we collected information regarding demographics of both the student and the faculty evaluator. These were used to determine gender and racial concordance (i.e., whether the student and faculty identified in the same group). We extracted the full narrative text of each evaluation, as well as numerical evaluation scores. Narrative text was processed using a Python NLP package, which assigned a sentiment score for each evaluation. Word clouds were generated for various demographic groups to exhibit the most commonly used words. We analyzed the distributions of sentiment scores and clerkship grades to determine any differences by gender or race/ethnicity. We used multinomial logistic regression to model final clerkship grades, using predictors such as numerical evaluation scores, gender and racial concordance between faculty and students, and sentiment scores. Statistical analyses were performed using R. Statistical significance was defined as P < .05. Results: We analyzed 963 evaluations from 198 students, with 109 (55%) females. Ninety-two (47%) identified as Caucasian, 77 (39%) Asian, 13 (7%) African American, and 16 (9%) other/unspecified or declined to state race. Ten (5%) identified as Hispanic. Females (190/534, 34%) were more likely to receive honors than males (113/420, 26%) (P = .02). Significantly more students received honors or near-honors grades in the 2020–2021 school year in comparison with the 2019–2020 school year (P < .00001). Sentiment scores for evaluations did not vary significantly by student gender, race, or ethnicity (P = .88, .64, and .06, respectively). Word choices were similar across faculty and student demographic groups. Similarly, in the multinomial logistic regression model of final clerkship grades, the narrative evaluation sentiment score was not predictive of an honors grade (odds ratio [OR] 1.23, P = .58). However, the numerical evaluation average (OR 1.45, P < .001) and gender concordance between faculty and student (OR 1.32, P = .049) were significant predictors of receiving honors. Discussion: There was no clear evidence of bias in medical student evaluations. No differences were found in sentiment scores or in word choices by gender or race/ethnicity. Of note, sentiment scores from narrative evaluations were not significantly associated with final grades, while numerical evaluation scores were. Narrative feedback tended to be positive for all students regardless of final grade, which may not be constructive to students for understanding how to improve. Significance: The lack of disparities in our study contrasts prior findings from other institutions. Ongoing efforts include comparative analyses with other institutions to understand what institutional factors (e.g., geographic location) may contribute to bias. NLP enables a systematic approach for investigating bias. The insights gained from the lack of association between word choices, sentiment scores, and final grades show potential opportunities to improve feedback processes for students.
Read full abstract