Crack detection methods using deep learning models such as convolutional neural network (CNN) and the newly developed vision transformer (ViT) are expanding. However, there is still a lack of comparative evaluation of these models in real-time crack detection. In this paper, a total of 14 lightweight deep learning models, comprising seven CNN models, five ViT models and two hybrid models, are trained to build deep learning-based crack detection methods. Comprehensive experiments are conducted on the publicly available DeepCrack dataset, including accuracy, inference time, robustness and transfer learning experiments to compare the effectiveness and real-time performance of models. In terms of accuracy metrics and robustness performance, the ViT model using SegFormer segmentation method with MiT-B1 as backbone has the best performance, and in terms of the model inference time, the ViT models using TopFormer segmentation method demonstrate the fastest performance. If both the accuracy and inference time are considered, TopFormer with its small version of the backbone network has relatively better real-time performance, while the ViT model using SegFormer segmentation method with MiT-B0 as backbone and the CNN model using the fully convolutional network (FCN) segmentation method with HRNetV2-W18-Small as backbone have higher mean intersection over union (mIoU) values on computers and mobile devices, respectively. We also find that pre-training on a dataset that is more relevant to the target application scenario rather than on the widely used ImageNet gives better results for deep learning models. This study provides a reference for engineers to make choices about lightweight deep learning models.
Read full abstract