Resuscitation events in pediatric critical and emergency care are high risk, and strong leadership is an important component of an effective response. The Concise Assessment of Leadership Management (CALM) tool, designed to assess the strength of leadership skills during pediatric crises, has shown promising validity and reliability in simulated settings. The objective of this study was to generate further validity and reliability evidence for the CALM by applying it to real-life emergency events. A prospective, video-based study was conducted in an academic pediatric emergency department. Three reviewers independently applied the CALM tool to the assessment of pediatric emergency department physicians as they led both a cardiac arrest and a sepsis event. Time to critical event (epinephrine, fluid, and antibiotic administration) was collected via video review. Based on Kane's framework, we conducted fully crossed, person × event × rater generalizability (G) and decision (D) studies. Interrater reliability was calculated using Gwet AC 2 and intraclass correlation coefficients. Time to critical events was correlated with CALM scores using Spearman coefficient. Nine team leaders were assessed in their leadership of 2 resuscitations each. The G coefficient was 0.68, with 26% subject variance, 20% rater variance, and no case variance. Thirty-three percent of the variance (33%) was attributed to third-order interactions and unknown factors. Gwet AC 2 was 0.3 and intraclass correlation was 0.58. The CALM score and time to epinephrine correlated at -0.79 ( P = 0.01). The CALM score and time to fluid administration correlated at -0.181 ( P = 0.64). This study provides additional validity evidence for the CALM tool's use in this context if used with multiple raters, aligning with data from the previous simulation-based CALM validity study. Further development may improve reliability. It also serves as an exemplar of the rigors of conducting validity work within medical simulation.