As urban air pollution becomes a severe environmental and societal issue globally, there is an increasing need on making air quality forecasts to prevent health and capital loss. Currently, most of the studies only provide air pollution forecasts on single or multiple sites, rather than citywide information. Due to complex urban structure and spatial transportation of air pollutants, the distribution of citywide air quality can be highly varying, which makes it extremely challenging to make accurate forecasts. Here, we address this challenge by exploring the potential of combining the dense air quality monitoring data with the deep learning technique. By using the fine-granular PM2.5 concentrations data collected from 417 micro monitoring stations in Lanzhou City, China, we proposed an end-to-end method to predict future citywide PM2.5 concentrations at 500 m × 500 m × 1 h resolution. This method consists of a spatio-temporal transformation component and a spatio-temporal prediction component, of which the former is responsible for converting PM2.5 concentration data into spatial-maps, while the spatio-temporal prediction network (i.e., Air-PredNet in our study) based on ConvLSTM in the latter is responsible for capturing the spatio-temporal characterizations in the spatial-map sequences and finally generating citywide air quality forecasts. The results show that our method can accurately predict the spatial distribution and variations of PM2.5 concentration with an RMSE of 2.12 μg/m3 for 1-h prediction and an RMSE of 4.09 μg/m3 for 6-h prediction. In addition, we also proved the robustness of Air-PredNet based on the prediction performance for O3 and NO2. Our results demonstrate the potential and feasibility of dense monitoring data combined with advanced data science methods for citywide urban air quality forecast, which supports the urban atmospheric environment fine management and decision-making.