Weakly supervised crack segmentation aims to create pixel-level crack masks with minimal human annotation, which often only differentiate between crack and normal no-crack patches. This task is crucial for assessing structural integrity and safety in real-world industrial applications, where manually labeling the location of cracks at the pixel level is both labor-intensive and impractical. Addressing the challenges of labeling uncertainty, this paper presents CrackCLIP, a novel approach that leverages language prompts to augment the semantic context and employs the Contrastive Language–Image Pre-Training (CLIP) model to enhance weakly supervised crack segmentation. Initially, a gradient-based class activation map is used to generate pixel-level coarse pseudo-labels from a trained crack patch classifier. The estimated coarse pseudo-labels are utilized to fine-tune additional linear adapters, which are integrated into the frozen image encoders of CLIP to adapt the CLIP model to the specialized task of crack segmentation. Moreover, specific textual prompts are crafted for crack characteristics, which are input into the frozen text encoder of CLIP to extract features encapsulating the semantic essence of the cracks. The final crack segmentation is determined by comparing the similarity between text prompt features and visual patch token features. Comparative experiments on the Crack500, CFD, and DeepCrack datasets demonstrate that the proposed framework outperforms existing weakly supervised crack segmentation methods, and the pre-trained vision-language model exhibits strong potential for crack feature learning, thereby enhancing the overall performance and generalization capabilities of the proposed framework.
Read full abstract