ABSTRACT Presence of clouds blocks the view of Earth’s surface objects in optical imageries, thus compromising their application and usability. Identifying and removing the clouds become a crucial task during image preprocessing. Recently deep learning (DL)-based cloud detection methods have shown improved performance, but capturing global semantic features and long-range dependencies necessitates a careful selection of DL classifiers to further enhance their effectiveness. Keeping this in view, the present study proposes a novel spatial-spectral attention transformer for cloud detection (SSATR-CD) with a spatial-spectral attention module that generates an enhanced feature map to replace convolution by using the image patches directly. To implement the proposed approach, a new Sentinel-2 data set with various types of cloud covers over India (IndiaS2) was created and tested with the proposed method. Alongside this, an additional benchmarked data set (WHUS2-CD) was also considered to check the ability of the proposed model to different regions of the world by applying model-based transfer learning. The result highlights the effectiveness and efficiency of the SSATR-CD approach in both cases.