Abstract

Computer vision based visual inspection systems are gaining enormous importance for manufacturing quality control in recent years due to the advent of Convolutional neural networks (CNN) and transformer-based (vision) models. CNN based models attempt to extract global features by gradually increasing the receptive field, while long-range dependencies are ignored. Therefore, CNN recognizes objects based on the texture instead of the shape. Transformer models, on the other hand, enable modeling long range dependencies using self-attention mechanism. But learning ability of spatial information inside each patch is limited, which means it can disregard a significant spatial local pattern, such as texture. In this work, we propose to combine transformer-based and CNN-based models to take advantage of the strengths of both methods. To meet inference time constraints of real time defect classification tasks, we exploit knowledge distillation method (KD) using softened logits of ensemble model as supervision to train a lightweight CNN model (Resnet18). The study showed that the proposed vision transformer-based KD approach overcome the requirements of limited computational resources and can be deployed on low-power and resource limited devices. The experimental results also showed that proposed framework outperforms in terms of mean accuracy on the test datasets compared to stand-alone CNN methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call