A weakly-supervised transformer-based hybrid network with multi-attention for pavement crack detection

Zhenlin Wang,Zhufei Leng,Zhixin Zhang

doi:10.1016/j.conbuildmat.2023.134134

Abstract

At present, crack detection is of grand importance for the maintenance of infrastructure, one of which the most crucial kind in China is roads. Road safety accidents, which are mainly caused by cracks, have a significant influence on people's property, life security and the economic development of the society. Thus, it is essential to accurately identify the pavement defects and promptly repair them in order to prolong the lifespan of the road, minimize maintenance expenses, prevent further deterioration of the road and decrease the occurrence of hazards. In recent years, deep neural networks have achieved a huge degree of success in crack detection, resulting in substantial savings in terms of manpower, time and money when compared to conventional approaches. Nevertheless, owing to numerous difficulties, including time-consuming pixel annotation, inadequacy in acquiring information, discontinuous cracks and low-quality images, the detection of pavement defects remains a great challenge, still having some tricky issues demanding fabulous solutions. To this end, we propose a novel Weakly-Supervised hybrid network with multi-attention, termed CGTr-Net, for pavement crack detection. Aiming at alleviating the loss of information, behaving well in extracting both local and global features, the architecture of the backbone CG-Trans was designed. It is a combination of Convolutional Neural Network (CNN), which is expert in extracting local features but experiencing difficulties to capture global representations, and Gated axial Transformer, whose gated position-sensitive axial attention mechanism can efficiently extract long-distance feature dependencies but deteriorate in capturing local feature details. To enhance feature fusion between the Transformer Layer and the Convolution Layer, a feature fusion module (TCFF) was added to this network. The two feature maps obtained from Transformer and CNN are utilized to generate Grad-CAM. Subsequently, we use Conditional Random Field (CRF) to further refine the Grad-CAM and adapt Affinity from Attention (AFA), which learn semantic affinity from the Gated Axial Transformer and the Convolutional Neural Network, to produce more accurate pseudo labels. The proposed CGTr-Net is evaluated on two different crack segmentation datasets and our CGTr-Net achieves the highest scores of Recall (Re), F-score (F1) and the mean intersection-over-union (mIoU) on the two benchmark datasets, surpassing all the competitors in the experiment. These results demonstrate the robustness, effectiveness and the superiority of our CGTr-Net compared with existing state-of-the-art methods.

Full Text