Stochastic rainfall models are important tools for evaluating hydrological risks such as flooding and drought because of their ability to randomly generate alternative plausible climatic timeseries. The stochastic generation of climatic timeseries is not an end in itself, since they are typically applied to a catchment to determine the performance of water-related infrastructure systems, such as reservoirs or flood-control measures. This methodology typically involves a train of models to determine the end-of-system impact, yet the evaluation of stochastically generated rainfall timeseries is usually a stand-alone procedure focused on metrics directly related to the stochastic generator. This paper demonstrates discrepancies in this approach by evaluating two, daily-timestep, stochastic rainfall models in terms of rainfall metrics and their subsequently generated flow metrics after rainfall-runoff transformation. The two models are a Markov-based model and a latent-variable model, where each model is calibrated and evaluated showing ‘overall good’ performance. Stochastically generated timeseries, alongside observed rainfall timeseries are inputted to a calibrated catchment model (GR4J) to derive daily flow timeseries. Whereas the rainfall metrics typically showed ‘good’ performance, streamflow-based metrics are not necessarily ‘good’. The procedure is repeated for 277 stations from Australia and 106 stations from the United States of America. Depending on the strictness of the flow-based comparison and region analysed, using the Markov-based model 12–26% of sites were classified as ‘poor’ performing, and 1%-9% of sites were classified as ‘poor’ using the latent-variable model. The results demonstrate that catchment-based performance of flow metrics is more holistic since it magnifies features of the rainfall not otherwise visible to rainfall-based evaluation.