One of the main objectives of numerical weather prediction models is reliable forecasting of heavy rain events. This paper discusses problems and strategies of evaluation of daily rain forecasting with operationally available rain station data. The focus is on spatial upscaling of rain station data to the grid of the direct model output. We show limitations of regression – or smoothing – based upscaling like as done, for example, by Kriging analysis and promote probabilistic upscaling by ensembles of stochastic simulations conditioned to the available observations. These ensembles easily provide uncertainties of daily evaluation and unbiased estimates for second moment comparison statistics. As an evaluation exercise we assess the quality of daily forecasts for Austria (total area: 84,000 km2) with the limited-area model ALADIN (horizontal grid-spacing 10 km). A quasi–operational set-up is compared to a physically enhanced but less well tested and tuned set-up. It is shown that the evaluation uncertainty is large, but with a full year of forecasts available it is possible to conclude that the physically enhanced set-up simulates too much rain and significantly more than the operational version with only small differences in simulated patterns and variability.