Abstract. Satellite observations can provide valuable information for a better understanding of hydrological processes and thus serve as valuable tools for model structure development and improvement. While model calibration and evaluation have in recent years started to make increasing use of spatial, mostly remotely sensed information, model structural development largely remains to rely on discharge observations at basin outlets only. Due to the ill-posed inverse nature and the related equifinality issues in the modelling process, this frequently results in poor representations of the spatio-temporal heterogeneity of system-internal processes, in particular for large river basins. The objective of this study is thus to explore the value of remotely sensed, gridded data to improve our understanding of the processes underlying this heterogeneity and, as a consequence, their quantitative representation in models through a stepwise adaptation of model structures and parameters. For this purpose, a distributed, process-based hydrological model was developed for the study region, the poorly gauged Luangwa River basin. As a first step, this benchmark model was calibrated to discharge data only and, in a post-calibration evaluation procedure, tested for its ability to simultaneously reproduce (1) the basin-average temporal dynamics of remotely sensed evaporation and total water storage anomalies and (2) their temporally averaged spatial patterns. This allowed for the diagnosis of model structural deficiencies in reproducing these temporal dynamics and spatial patterns. Subsequently, the model structure was adapted in a stepwise procedure, testing five additional alternative process hypotheses that could potentially better describe the observed dynamics and pattern. These included, on the one hand, the addition and testing of alternative formulations of groundwater upwelling into wetlands as a function of the water storage and, on the other hand, alternative spatial discretizations of the groundwater reservoir. Similar to the benchmark, each alternative model hypothesis was, in a next step, calibrated to discharge only and tested against its ability to reproduce the observed spatio-temporal pattern in evaporation and water storage anomalies. In a final step, all models were re-calibrated to discharge, evaporation and water storage anomalies simultaneously. The results indicated that (1) the benchmark model (Model A) could reproduce the time series of observed discharge, basin-average evaporation and total water storage reasonably well. In contrast, it poorly represented time series of evaporation in wetland-dominated areas as well as the spatial pattern of evaporation and total water storage. (2) Stepwise adjustment of the model structure (Models B–F) suggested that Model F, allowing for upwelling groundwater from a distributed representation of the groundwater reservoir and (3) simultaneously calibrating the model with respect to multiple variables, i.e. discharge, evaporation and total water storage anomalies, provided the best representation of all these variables with respect to their temporal dynamics and spatial patterns, except for the basin-average temporal dynamics in the total water storage anomalies. It was shown that satellite-based evaporation and total water storage anomaly data are not only valuable for multi-criteria calibration, but can also play an important role in improving our understanding of hydrological processes through the diagnosis of model deficiencies and stepwise model structural improvement.