Extreme precipitation events often present significant risks to human life and property, making their accurate prediction an essential focus of current research. Recent studies have primarily concentrated on exploring the formation mechanisms of extreme precipitation. Existing prediction methods do not adequately account for the combined terrain and atmospheric effects, resulting in shortcomings in extreme precipitation forecasting accuracy. Additionally, the satellite data resolution used in prior studies fails to precisely capture nuanced details of abrupt changes in extreme precipitation. To address these shortcomings, this study introduces an innovative approach for accurately predicting extreme precipitation: the multimodal attention ConvLSTM-GAN for extreme rainfall nowcasting (ER-MACG). This model employs high-resolution Fengyun-4A(FY4A) satellite precipitation products, as well as terrain and atmospheric datasets as inputs. The ER-MACG model enhances the ConvLSTM-GAN framework by optimizing the generator structure with an attention module to improve the focus on critical areas and time steps. This model can alleviate the problem of information loss in the spatial–temporal convolutional long short-term memory network (ConvLSTM) and, compared with the standard ConvLSTM-GAN model, can better handle the detailed changes in time and space in extreme precipitation events to achieve more refined predictions. The main findings include the following: (a) The ER-MACG model demonstrated significantly greater predictive accuracy and overall performance than other existing approaches. (b) The exclusive consideration of DEM and LPW data did not significantly enhance the ability to predict extreme precipitation events in Zhejiang Province. (c) The ER-MACG model significantly improved in identifying and predicting extreme precipitation events of different intensity levels.