Introduction Neonatal resuscitation is a high-acuity, low-occurrence event that requires ongoing practice by interprofessional teams to maintain proficiency.Simulation provides an ideal platform for team training and evaluation of team performance. Our simulation center supports a longitudinal in situ simulation training program for delivery room teams. In addition to adherence to the Neonatal Resuscitation Program standards, team performance assessment is an essential component of program evaluation and participant feedback. Multiple published teamwork assessment tools exist. Our objective was to select the tool with the best validity evidence for our program's needs. Methods We used Messick's framework to assess the validity of evidence for potential teamwork assessment tools. Four possible tools were identified from the literature: the Mayo High Performance Teamwork Scale (Mayo), Team Performance Observation Tool (TPOT), Clinical Teamwork Scale (CTS), and Team Emergency Assessment Measure (TEAM). Relevant context included team versus individual focus, external evaluator versus self-evaluation, and ease of use (which included efficiency, clarity of interpretation, and overall assessment). Three simulation experts identified consensus anchors for each tool and independently reviewed and scored 10 pre-recorded neonatal resuscitation simulations. Raters assigned each tool a rating according to efficiency, ease of interpretation, and completeness of teamwork assessment. Interrater reliability (IRR) was calculated using intraclass correlation for each tool across the three raters. Average team performance scores for each tool were correlated with neonatal resuscitation adherence scores for each video using Spearman's rank coefficient. Results There was a range of IRR between the tools, with Mayo having the best (single 0.55 and multi 0.78). Each of the three raters ranked Mayo optimally in terms of efficiency (mean 4.66 + 0.577) and ease of use (4+1). However, TPOT and CTS scored highest (mean 4.66 ± 0.577) for overall completeness of teamwork assessment. There was no significant correlation to NRP adherence scores for any teamwork tool. Conclusion Of the four tools assessed, Mayo demonstrated moderate IRR and scored highest for its ease of use and efficiency, though not completeness of assessment. The remaining three tools had poor IRR, which is not an uncommon problem with teamwork assessment tools. Our process emphasizes the fact that assessment tool validity is contextual. Factors such as a relatively narrow (and high) performance distribution and clinical context may have contributed to reliability challenges for tools that offered a more complete teamwork assessment.