Abstract

ObjectiveVision Transformers (ViTs) have shown promising performance in various classification tasks previously dominated by Convolutional Neural Networks (CNNs). However, the performance of ViTs in referable Diabetic Retinopathy (DR) detection is relatively underexplored. In this study, using retinal photographs, we evaluated the comparative performances of ViTs and CNNs on detection of referable DR. DesignRetrospective study ParticipantsA total of 48,269 retinal images from the open-source Kaggle DR detection dataset, the Messidor-1 dataset and the Singapore Epidemiology of Eye Diseases (SEED) study were included. MethodUsing 41,614 retinal photographs from the Kaggle dataset, we developed five CNN (VGG19, ResNet50, InceptionV3, DenseNet201 and EfficientNetV2S) and four ViTs models (VAN_small, CrossViT_small, ViT_small and SWIN_tiny) for the detection of referable DR. We defined the presence of referable DR as eyes with moderate or worse DR. The comparative performance of all nine models was evaluated in the Kaggle internal test dataset (with 1,045 study eyes), and in two external test sets, the SEED study (5,455 study eyes) and the Messidor-1 (1,200 study eyes). Main Outcome MeasuresArea Under Receiver Operative Curve (AUC), Specificity and Sensitvity. ResultsAmong all models, the SWIN transformer displayed the highest area under the curve (AUC) of 95.7% on the internal test set, significantly outperforming the CNN models (all P<0.001). The same observation was confirmed in the external test sets, with the SWIN transformer achieving AUC of 97.3% in SEED and 96.3% in Messidor-1. When specificity level was fixed at 80% for the internal test, the SWIN transformer achieved the highest sensitivity of 94.4%, significantly better than all the CNN models (sensitivity levels ranging between 76.3% and 83.8%; all P <0.001). This trend was also consistently observed in both external test sets. ConclusionOur findings demonstrate that ViTs provide superior performance over CNNs in detecting referable DR from retinal photographs. These results point to the potential of utilizing ViT models to improve and optimize retinal photo-based deep learning for referable DR detection.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.