The analysis of the near-surface air temperature is vital for many applications such as urban heat islands and climate change studies. In particular, extreme weather events are typically localized and so should the corresponding adaptation measures. However, climate scientists are often confronted with the difficulty of providing reliable predictions at high spatial resolutions in the order of 1 km. We propose to train a convolutional neural network model to emulate the hourly high-resolution near-surface air temperature field simulated by the Weather Research and Forecasting (WRF) software over a period of 18 months. The model is driven by current and past lags of coarse SEVIRI land surface temperature fields. This nowcasting application implements a downscaling of the spatial resolution of the input by about a factor of four, while establishing a correlation between current and past land surface temperature maps and the current near-surface air temperature field. The U-Net variant that is proposed in this study uses regularization to prevent over-fitting and implements a novel space-time approach, where multiple time steps are fed into the model through 3D convolution layers. Besides LST, the model also uses urban density as additional static input to be able to predict the temperature more accurately in urban areas and to improve the generalizability of the trained model. The performance of our U-Net model is assessed via comparison to an MLR benchmark (ridge regularization). The model is shown to be superior on all performance metrics. It achieves a mean absolute error of 1.36 °C versus 1.49 °C for benchmark (a 9% relative improvement) and a root mean square error of 1.77 °C versus 1.91 °C for benchmark (a 7% relative improvement). The absolute error of the model is less than 2 °C for 77% of the prediction points versus 72% for the benchmark (a 7% relative improvement). The improvement over the benchmark is even more significant during extreme hot periods. We demonstrate the generalizability of the approach by testing the trained model on unseen spatial domains.