Vision Transformer in Industrial Visual Inspection

Nils Hütten,Tobias Meisen,Richard Meyes

doi:10.3390/app122311981

Nils Hütten, Tobias Meisen + Show 1 more

Open Access

https://doi.org/10.3390/app122311981

Copy DOI

Journal: Applied Sciences	Publication Date: Nov 23, 2022
Citations: 10	License type: CC BY 4.0

Affiliation: University of Wuppertal

Abstract

Artificial intelligence as an approach to visual inspection in industrial applications has been considered for decades. Recent successes, driven by advances in deep learning, present a potential paradigm shift and have the potential to facilitate an automated visual inspection, even under complex environmental conditions. Thereby, convolutional neural networks (CNN) have been the de facto standard in deep-learning-based computer vision (CV) for the last 10 years. Recently, attention-based vision transformer architectures emerged and surpassed the performance of CNNs on benchmark datasets, regarding regular CV tasks, such as image classification, object detection, or segmentation. Nevertheless, despite their outstanding results, the application of vision transformers to real world visual inspection is sparse. We suspect that this is likely due to the assumption that they require enormous amounts of data to be effective. In this study, we evaluate this assumption. For this, we perform a systematic comparison of seven widely-used state-of-the-art CNN and transformer based architectures trained in three different use cases in the domain of visual damage assessment for railway freight car maintenance. We show that vision transformer models achieve at least equivalent performance to CNNs in industrial applications with sparse data available, and significantly surpass them in increasingly complex tasks.

Full Text