Objectives: Prior research has demonstrated that men and women emergency medicine (EM) residents receive similar numerical evaluations at the beginning of residency, but that women receive significantly lower scores than men in their final year. To better understand the emergence of this gender gap in evaluations we examined discrepancies between numerical scores and the sentiment of attached textual comments. Methods: This multicenter, longitudinal, retrospective cohort study took place at four geographically diverse academic EM training programs across the United States from July 1, 2013-July 1, 2015 using a real-time, mobile-based, direct-observation evaluation tool. We used complementary quantitative and qualitative methods to analyze 11,845 combined numerical and textual evaluations made by 151 attending physicians (94 men and 57 women) during real-time, direct observations of 202 residents (135 men and 67 women). Results: Numerical scores were more strongly positively correlated with positive sentiment of the textual comment for men (r = 0.38, P < 0.001) compared to women (r = -0.26, P < 0.04); more strongly negatively correlated with mixed (r = -0.39, P < 0.001) and negative (r = -0.46, P < 0.001) sentiment for men compared to women (r = -0.13, P < 0.28) for mixed sentiment (r = -0.22, P < 0.08) for negative; and women were around 11% more likely to receive positive comments alongside lower scores, and negative or mixed comments alongside higher scores. Additionally, on average, men received slightly more positive comments in postgraduate year (PGY)-3 than in PGY-1 and fewer mixed and negative comments, while women received fewer positive and negative comments in PGY-3 than PGY-1 and almost the same number of mixed comments. Conclusion: Women EM residents received more inconsistent evaluations than men EM residents at two levels: 1)inconsistency between numerical scores and sentiment of textual comments; and 2)inconsistency in the expected career trajectory of improvement over time. These findings reveal gender inequality in how attendings evaluate residents and suggest that attendings should be trained to provide all residents with feedback that is clear, consistent, and helpful, regardless of resident gender.