Vision Transformer Outperforms Deep Convolutional Neural Network-based Model in Classifying X-ray Images

Om Uparkar,Jyoti Bharti,R.K Pateriya,Rajeev Kumar Gupta,Ashutosh Sharma

doi:10.1016/j.procs.2023.01.209

Abstract

The standard approach for automated clinical image diagnosis is being held with the use of Convolutional Neural Networks (CNN) for a decade. Vision Transformers (ViT) are new in this domain and yield similar levels of performance to that of CNN making them a competitive alternative to CNNs. This paper proposes an alternative off-the-shelf ViT-based approach to detecting lung diseases. This approach has been compared with a CNN-based hybrid deep learning approach that outperforms existing different deep learning techniques. The hybrid deep learning model used for comparison is called Visual Geometric Group Data Spatial Transformer with CNN (VDSNet) and the experimental results are computed by using the open-source NIH chest X-rays dataset from Kaggle. In this study, we observe vision transformers that are pre-trained outperform CNN-based VDSNet in several metrics on full as well as different subsets of the dataset. Vision Transformers also show an increase in accuracy with the addition of internal layers and reduction in patch size at the expense of slightly higher training time making them a potential alternative to Convolutional Neural Networks.

Full Text