ABSTRACT Urban floods pose a significant threat to human communities, making its prediction essential for comprehensive flood risk assessment and the formulation of effective resource allocation strategies. Data-driven deep learning approaches have gained traction in urban emergency flood prediction, addressing the efficiency constraints of physical models. However, the spatial structure of rainfall, which has a profound influence on urban flooding, is often overlooked in many deep learning investigations. In this study, we introduce a novel deep learning model known as CRU-Net equipped with an attention mechanism to predict inundation depths in urban terrains based on spatiotemporal rainfall patterns. This method utilizes eight topographic parameters related to the height of urban waterlogging, combined with spatial rainfall data as inputs to the model. Comparative evaluations between the developed CRU-Net and two other deep learning models, U-Net and ResU-Net, reveal that CRU-Net adeptly interprets the spatiotemporal traits of rainfall and accurately estimates flood depths, emphasizing deep inundation and flood-vulnerable regions. The model demonstrates exceptional accuracy, evidenced by a root mean square error of 0.054 m and a Nash–Sutcliffe efficiency of 0.975. CRU-Net also accurately predicts over 80% of inundation locations with depths exceeding 0.3 m. Remarkably, CRU-Net delivers predictions for 3 million grids in 2.9 s, showcasing its efficiency.