Cracks are a common form of damage in infrastructure, posing significant risks to both personal safety and property. Along with the development of deep learning, visual-based crack automatic detection has been widely studied. However, this task is still challenging due to complex crack topology, noisy backgrounds, unbalanced categories, etc. To address these challenges, this research proposes a novel hybrid network, named CrackNet, which leverages the strengths of both CNN and transformer. On the encoder side, CNNs are employed to extract multi-level local features, while transformers are used to model global dependencies. Additionally, a strip pooling module is introduced to suppress irrelevant regions and enhance the network’s ability to segment narrow and elongated cracks. On the decoder side, an attention-based skip connection strategy and a mixed up-sampling module are implemented to restore detailed information. Furthermore, a joint learning loss combining Dice and cross-entropy with dynamic weighting is proposed to mitigate the effects of severe class imbalance. CrackNet is trained and evaluated on three public crack datasets, and experimental results show that the proposed model outperforms several well-known deep neural networks, with a particularly noticeable improvement in recall rate.