Irwin tests are key preclinical study elements for characterising drug-induced neurological side effects. This multicentre study aimed to assess the robustness of Irwin tests across multinational sites during three stages of protocol harmonisation. The projects were part of the Enhanced Quality in Preclinical Data framework, aiming to increase success rates in transition from preclinical testing to clinical application. Female and male NMRI mice were assigned to one of three groups (vehicle, MK-801 0.1and 0.3 mg kg-1). Irwin scores were assessed at baseline and multiple times following intraperitoneal injection of MK-801 using local protocols (Stage 1), shared protocols with harmonised environmental design (Stage 2) and fully harmonised Irwin scoring protocols (Stage 3). The analysis based on the four functional domains (motor, autonomic, sedation and excitation) revealed substantial data variability in Stages 1 and 2. Although there was still marked overall heterogeneity between sites in Stage 3 after complete harmonisation of the Irwin scoring scheme, heterogeneity was only moderate within functional domains. When comparing treatment groups versus vehicle, we found large effect sizes in the motor domain and subtle to moderate effects in the excitation-related and autonomic domains. The pronounced interlaboratory variability in Irwin datasets for the CNS-active compound MK-801 needs to be carefully considered when making decisions during drug development. While environmental and general study design had a minor impact, the study suggests that harmonisation of parameters and their scoring can limit variability and increase robustness.