The application of remote sensing technology in water body detection has become increasingly widespread, offering significant value for environmental monitoring, hydrological research, and disaster early warning. However, the existing methods face challenges in multi-scene and multi-temporal water body detection, including the diverse variations in water body shapes and sizes that complicate detection; the complexity of land cover types, which easily leads to false positives and missed detections; the high cost of acquiring high-resolution images, limiting long-term applications; and the lack of effective handling of multi-temporal data, making it difficult to capture the dynamic changes in water bodies. To address these challenges, this study proposes a novel network for multi-scene and multi-temporal water body detection based on spatiotemporal feature extraction, named TSAE-UNet. TSAE-UNet integrates convolutional neural networks (CNN), depthwise separable convolutions, ConvLSTM, and attention mechanisms, significantly improving the accuracy and robustness of water body detection by capturing multi-scale features and establishing long-term dependencies. The Otsu method was employed to quickly process Sentinel-1A and Sentinel-2 images, generating a high-quality training dataset. In the first experiment, five rectangular areas of approximately 37.5 km2 each were selected to validate the water body detection performance of the TSAE-UNet model across different scenes. The second experiment focused on Jining City, Shandong Province, China, analyzing the monthly water body changes from 2020 to 2022 and the quarterly changes in 2022. The experimental results demonstrate that TSAE-UNet excels in multi-scene and long-term water body detection, achieving a precision of 0.989, a recall of 0.983, an F1 score of 0.986, and an IoU of 0.974, significantly outperforming FCN, PSPNet, DeepLabV3+, ADCNN, and MECNet.