With climate change intensifying, increasingly extreme rainfall events which lead to flooding are occurring at a higher frequency. As such, it is crucial to develop and implement useful flood forecasting models which can provide early knowledge of the severity of flood risks and allow decision makers to take preventive and response actions which reduce human casualties. This study, carried out in collaboration with the National Science and Technology Center for Disaster Reduction (NCDR) of Taiwan, utilizes deep learning techniques combined with attention mechanisms to establish a data-driven flood forecasting framework which can serve as an early warning system for flooding hazards. The overall framework, which utilizes historical flood depth data from an array of flood sensors while engineering additional features from KRID (estimation Rainfall with Integrated radar-gauge Data) observed rainfall data consists of two phases. The first phase is a flood event detection model built using a classification-based random forest algorithm. The second phase is a flood depth prediction model which utilizes a spatial–temporal attention mechanism combined with a transformation-gated long short-term memory (LSTM).This novel model, known as the Spatial–Temporal Attention Transformation-Gated LSTM (STA-TGLSTM), is capable of predicting flood levels over the next three hours. In an empirical study based in Changhua County, Taiwan, STA-TGLSTM achieves the lowest RMSE losses for predicting the next three hours of flood levels compared to baseline machine learning models. Notably, the accuracy of predicting the flood sensor depth for the first hour (T+1) reaches 76% when STA-TGLSTM is applied. In comparison to traditional hydrological models, this data-driven model offers greater computational efficiency, providing future flood depth information within seconds. Moreover, as demonstrated in the empirical study, by visualizing the weight values of the spatial–temporal attention mechanism, valuable insights into the relationship between each flood sensor and the spatiotemporal variables can be readily gained.