ABSTRACT Cloud cover is a significant factor affecting the effectiveness of satellite-based Earth observations. Existing cloud detection algorithms primarily rely on imaging data from satellite sensors in the visible to near-infrared spectral range, making it challenging to achieve day-and-night cloud monitoring. Convolutional neural networks have shown outstanding performance in previous cloud detection algorithms due to their robust ability to extract local information. However, their inherent inductive bias limits their capacity to learn long-range semantic information. To address these challenges, we proposed SwinCloud, a U-shaped semantic segmentation network based on an enhanced Swin Transformer for cloud detection in the thermal infrared spectral range. Specifically, we augment the Swin Transformer's window attention module with a CNN-based parallel pathway to effectively model global-local information. We employ a feature fusion module before the final upsampling module in the decoder to better integrate low-level spatial information and high-level semantic information. On the Landsat-8 cloud detection dataset, our model outperforms state-of-the-art methods. When transferred to the SDGSAT-TIS cloud detection dataset, the mIOU of experiment results reaches 69.9%, demonstrating the strong transferability of SwinCloud across different sensors. We also applied SwinCloud to cloud detection in the visible bands of Landsat-8. The results demonstrated SwinCloud's generalization capability across different bands.