ABSTRACT In optical remote sensing images, clouds exhibit irregular scales and boundaries that vary with elevation across diverse geographical locations. To accurately capture the diverse visual patterns of clouds, we propose a cloud image segmentation approach named GS-CDNet (Geographic Spatial Data-Cloud Detection Network), which is based on the integration of geospatial data with multifaceted self-attention feature extraction, multi-scale feature aggregation, and boundary clarification techniques.Firstly, we utilize geographical coordinates from optical remote sensing images to extract a raster DEM (Digital Elevation Model) from SRTM3. This process creates a dataset consisting of elevation images, longitude, and latitude maps as geospatial data, enhancing the model’s capability in spatial positioning for cloud detection. Secondly, the proposed method consists of three interconnected modules within the cloud detection network: the Interleaved Self-Attention module(ISAM) utilizes a variety of self-attention mechanisms in an interleaved manner to extract multi-scale feature information.The Bidirectional Multi-Scale Feature Fusion Module(BIMFM) is responsible for integrating features, enabling a more comprehensive contextual understanding. The Boundary Extraction Module(BEM) utilizes a residual structure to generate a boundary cloud mask, effectively addressing the common issue of boundary blurring in multi-scale cloud masks. Finally, we compared and evaluated GS-CDNet with other cloud detection methods and conducted an ablation study on the key components of the method. The validation of generalization performance demonstrates the exceptional performance of the proposed model in cloud mask generation. Geospatial data and the different modules of the method play a significant role in the model.