The complexity of karst groundwater flow modelling is reflected by the amount of simulation approaches. The goal of the Karst Modelling Challenge (KMC) is comparing different approaches on one single system using the same data set. Thirteen teams with different computational models for simulating discharge variations at karst springs have applied their respective models on one single data set coming from the Milandre Karst Hydrogeological System (MKHS). The approaches include neural networks, reservoir models, semi-distributed models and fully distributed groundwater models. Four and a half years of hourly or daily meteorological input and hourly discharge data were provided for model calibration. The validation comprised forecasting one year of discharge, without the observed discharge data. The model performance was evaluated using the volume conservation, Nash-Sutcliffe efficiency (NSE) and the Kling-Gupta efficiency (KGE) applied on the total discharge and individual flow components. As a result, the comparison of model performances is a challenging task due to the differences in the model architecture but also required time steps: some of the models require aggregated daily steps while others could be run using hourly data, which provided some interesting differences depending on how the data was transformed. The use of instantaneous data (e.g. value at noon) produces less bias that averaging hourly data over one day. The transformation of hourly into daily data produces a decrease of Nash and KGE of 0.05 to 0.08 (i.e. from 1 to ~0.93). The resulting simulations (forecasted values for year 2016) produced KGEs ranging between 0.83 and 0.37 (0.83 to −0.24 for NSE). Although the simulations matched the monitored flows reasonably well, most models struggled to simulate baseflow conditions accurately. In general, the models that performed the best for this exercise were the global ones (Gardenia and Varkarst), with a limited number of parameters, which can be calibrated using automatic calibration procedures. The neural network models also showed a fair potential, with one providing reasonable results despite the relatively short dataset available for warming-up (4.5 years). Semi-and fully distributed models also suggested that with some more effort they could perform well. The accuracy of model predictions does not seem to increase by using models with more than 9–12 calibration parameters. An evaluation of the relative errors between the forecasted and the observed values revealed that for most models, 50% of the forecasted values contained more than 50% of difference against the observed discharge rate, with 25% having a difference larger than 100%. A significant part of the poorly forecasted values corresponded to base-flow which was surprising given that as base-flow is generally much easier to predict than peak flow. Hence, this shows that modelling approaches and criteria for the calibration are too oriented towards peak-flow sections of the hydrographs, and that improvements could be gained by more focus on the base-flow.