Abstract
Surface defect detection is an extremely crucial step to ensure the quality of industrial products. Nowadays, convolutional neural networks (CNNs) based on encoder–decoder architecture have achieved tremendous success in various defect detection tasks. However, the intrinsic locality of convolution prevents them from modeling long-range interactions explicitly, making it difficult to distinguish pseudo-defects in cluttered backgrounds. Recent transformers are especially skilled at learning global image dependencies, but with limited local structural information for the refined defect location. To overcome the above limitations, we incorporate CNN and transformer into an efficient hybrid transformer architecture for defect detection, termed Defect Transformer (DefT), to capture local and non-local relationships collaboratively. Specifically, in the encoder module, a convolutional stem block is firstly adopted to retain more spatial details. Then, the patch aggregation blocks are used to generate multi-scale representation with four hierarchies, each of them is followed by a series of DefT blocks, which respectively include a locally position-aware block for local position encoding, a lightweight multi-pooling self-attention to model multi-scale global contextual relationships with good computational efficiency, and a convolutional feed-forward network for feature transformation and further local information learning. Finally, a simple but effective decoder module is constructed to gradually recover spatial details from the skip connections in the encoder. Extensive experiments on three datasets demonstrate the superiority and efficiency of our method compared with other deeper and complex CNN- and transformer-based networks.
Submitted Version (Free)
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have