AbstractEvaluation metrics play a pivotal role in the calibration process of hydrological models, serving as objective functions that directly influence the final values of model parameters and significantly affect users' perceptions of model performance. However, the choice and interpretation of evaluation metrics are subjective; therefore, this study provides a more objective framework for assessing model performance. This paper initially explored the applicability of various commonly used evaluation metrics, providing an overview of their limitations. Following this, we decomposed errors by analysing their physical significance and geometric representation in scatter plots, categorizing them into systematic and unsystematic errors. Through the decomposition and derivation of the Nash–Sutcliffe efficiency (NSE) formula, we established the quantitative relationship among various evaluation metrics. The soil and water assessment tool (SWAT) model was utilized to simulate monthly runoff in the Baishan basin (China), for the period 1994–2017, with NSE serving as the objective function for calibration. Our findings are consistent with previous studies, indicating that the model tends to slightly underestimate high flows while significantly overestimating low flows. Further analysis through error decomposition and the examination of relationships among various evaluation metrics revealed that unsystematic errors were dominant during the spring snowmelt runoff period, while systematic errors prevailed in the dry season. By evaluating the runoff series based on the magnitude of runoff or by categorizing it according to seasons and months, a more stringent assessment of the model's performance was achieved. These findings not only highlight the necessity for careful selection of evaluation metrics but also underscore the significance of our methodological advancements in enhancing hydrological model precision and reliability.