Abstract Study question Which criteria, described by WHO 2010, cause most problems during sperm morphology assessment and lead to outcome variation in a national external quality control program? Summary answer Assessment of head ovality, regularly contoured head, regularly contoured midpiece and alignment of major axis of midpiece and head lead to the most variation. What is known already Morphology assessment of spermatozoa is known as a rather difficult part of semen analysis. Over the past 40 years, morphological criteria became stricter and reference values changed significantly, leading to a 4% lower reference value for normal/typical morphology in 2010. This has consequences for the statistical power of the analysis. Moreover, many laboratories do not use the staining method as advised by the WHO and are getting stricter and stricter in the application of the criteria. Improvement of the assay is therefore necessary. In this study, as a first step, variation in the use of the strict criteria is evaluated. Study design, size, duration Data from the Dutch external quality control (EQC) program were evaluated in this retrospective study over the period 2015 – 2020. The program consists of four rounds per year and includes the assessment of three photos of Papanicolaou stained spermatozoa. These spermatozoa were dichotomously judged (normal/abnormal) on 14 morphological criteria (WHO manual, 2010). Consensus results of three experts served as reference. In total, variation over results of 72 photos (1008 values) was analysed. Participants/materials, setting, methods Participants were staff members from Dutch laboratories (1 member per lab per round) that perform semen analysis. The outcomes of the participants were tested for variation per criterion, both over the entire 6-year period and for trends during this period. To gain insight in the influence of “time”, three photos were provided three times (in 2015/2018/2020) and six photos were provided twice (in 2016/2018 and 2018/2020). Setup was blinded to both participants and experts. Main results and the role of chance In the period 2015 – 2020, 88 – 103 laboratories participated in the EQC program. Of these laboratories, 40 – 60 took part in the photo evaluation. Variation per criterion was expressed in categories green, orange and red, with resp. >90%, 60-90% and <60% agreement between the participants. Overall, variation was in 57% in category green, 37% orange and 6% red. Head ovality, regularity of head contours, regularity of midpiece contours and alignment of the major axis of midpiece and head lead to the most variation. For these criteria, resp. 14, 17, 10 and 17% were in the category red and resp. 50, 47, 71 and 64% in category orange. Lowest variation was found for acrosomal vacuoles, excessive residual cytoplasm, tail thickness and tail length with resp. 76, 77, 94, 85% in category green. Trend analysis lead to similar conclusions: most criteria show a slightly positive trend, but head ovality and regularity of head and midpiece show a stable or declining trend. Three photos were used in three rounds and six photos in two rounds. In 26 (8.8%) cases, shifts towards higher (5) or lower (21) variation were found. Experts changed their opinion in 3 (1%) cases. Limitations, reasons for caution Results are dependent on the morphology of the spermatozoa (magnitude of abnormalities) and of the photo quality. Wider implications of the findings The definitions of the criteria need to be better explained and trained, especially for ovality of the head, regularity of the midpiece and overlap of the longitudinal axes of midpiece and head. Moreover, explanation of the criteria in the light of physiology will probably lead to better evaluations by participants. Trial registration number Not Applicable