To compare the performance of the convolutional neural network (CNN) with the vision transformer (ViT) and the gated multilayer perceptron (gMLP) in the classification of radiographic images of dental structures. Retrospectively collected 2-dimensional images derived from cone beam computed tomographic volumes were used to train CNN, ViT and gMLP architectures as classifiers for 4 different cases. Cases selected for training the architectures were the classification of the radiographic appearance of maxillary sinuses, maxillary and mandibular incisors, presence or absence of the mental foramen and the positional relationship of the mandibular third molar to the inferior alveolar nerve canal. The performance metrics (sensitivity, specificity, precision, accuracy and f1-score) and area under curve (AUC) - receiver operating characteristic and precision-recall curves were calculated. The ViT with an accuracy of 0.74-0.98, performed on par with the CNN model (accuracy 0.71-0.99) in all tasks. The gMLP displayed marginally lower performance (accuracy 0.65-0.98) as compared to the CNN and ViT. For certain tasks, the ViT outperformed the CNN. The AUCs ranged from 0.77-1.00 (CNN), 0.80-1.00 (ViT) and 0.73-1.00 (gMLP) for all of the 4 cases. The difference in performance of the ViT, gMLP and the CNN (the current state-of-the-art) was significant in certain tasks. This difference in model performance for various tasks proves that capabilities of different architectures may be leveraged. The vision transformer, followed by the gated multilayer perceptron are deep learning models that exhibit comparable performance with the convolutional neural network in the classification of dental radiographic images.
Read full abstract