Efficient road inspection and maintenance are essential to extend pavement lifespan and enhance safety. However, automated crack detection remains challenging due to varied environmental conditions and differences in image collection equipment, making robust algorithm development a critical need. Vision Transformers, with their capacity to capture long-range dependencies, offer significant advantages for crack detection in complex scenarios by effectively extracting global features. Nevertheless, existing Transformer-based methods encounter difficulties in boundary delineation due to decoder design limitations, which lead to suboptimal fusion of low-level and high-level features. To address this issue, we propose a comprehensive approach that integrates semantic preservation, detail refinement, and detail delineation. These concepts are realized through our novel Dual-Cross Attention Module (DCA) and Upsampling Attention Module (UA). The DCA module progressively filters redundant details from low-level feature layers using high-level semantic information, while preserving boundary details to refine high-level feature boundaries. In addition, the UA module employs progressive local cross-attention in upsampling, facilitating more precise boundary definitions and surpassing conventional dynamic upsampling methods. Our approach, utilizing both lightweight (MiT-B0, LVT) and middle-weight (Swin-T) backbones, demonstrates state-of-the-art performance on three diverse datasets—Crack500, CrackSC, and UAV-Crack500—highlighting its robustness across varied conditions. This work contributes to advancing Transformer-based architectures for defect segmentation in complex engineering contexts, underscoring the critical role of improved feature fusion in crack detection. The code is available at: https://github.com/SHAN-JH/DCUFormer.
Read full abstract