Detecting cracks in building structures is an essential practice that ensures safety, promotes longevity, and maintains the economic value of the built environment. In the past, machine learning (ML) and deep learning (DL) techniques have been used to enhance classification accuracy. However, the conventional CNN (convolutional neural network) methods incur high computational costs owing to their extensive number of trainable parameters and tend to extract only high-dimensional shallow features that may not comprehensively represent crack characteristics. We proposed a novel convolution and composite attention transformer network (CCTNet) model to address these issues. CCTNet enhances crack identification by processing more input pixels and combining convolution channel attention with window-based self-attention mechanisms. This dual approach aims to leverage the localized feature extraction capabilities of CNNs with the global contextual understanding afforded by self-attention mechanisms. Additionally, we applied an improved cross-attention module within CCTNet to increase the interaction and integration of features across adjacent windows. The performance of CCTNet on the Historical Building Crack2019, SDTNET2018, and proposed DS3 has a precision of 98.60%, 98.93%, and 99.33%, respectively. Furthermore, the training validation loss of the proposed model is close to zero. In addition, the AUC (area under the curve) is 0.99 and 0.98 for the Historical Building Crack2019 and SDTNET2018, respectively. CCTNet not only outperforms existing methodologies but also sets a new standard for the accurate, efficient, and reliable detection of cracks in building structures.