Using Vision Transformers in 3-D Medical Image Classifications

Lulu Gai,Rui Gao,Wei Chen,Xu Qiao,Yan-Wei Chen

doi:10.1109/icip46576.2022.9897966

Abstract

Convolutional Neural Networks (CNNs) have been the controlling deep learning approach for a decade in automated medical image diagnosis. Recently, vision transformers (ViTs) have appeared as a competitive alternative to CNNs in computer vision, yielding similar levels of performance while possessing several interesting properties that could prove to be beneficial for the explanation of deep neural networks. Since most medical images are grayscale scans of CT, MRI, etc. and in 3-dimensional (3-D) spaces, which are highly different from natural images, we explore whether it is possible to move to transformer-based models or if we should keep working with CNNs in the domain of 3-D medical image classifi-cations. If so, what are the advantages and drawbacks of switching to ViTs for medical image diagnosis? We consider these problems in a series of experiments on three 3-D medical image datasets. Our findings show that, while CNNs perform better when trained from scratch, ViTs gain strong benifit when pre-trained on ImageNet and outperform their CNN counterparts using self-supervised learning and sharpness-aware minimizer optimization method on the large datasets.

Full Text