Vision transformers for segmentation of disc and cup in retinal fundus images

Yakoub Bazi,Mohamad Mahmoud Al Rahhal,Hebah Elgibreen,Mansour Zuair

doi:10.1016/j.bspc.2023.105915

Abstract

Several semantic segmentation models have been proposed in recent years to screen for Glaucoma in retinal fundus images, but the lack of comprehensive benchmark datasets makes it difficult to analyze their generalization ability. This paper conveys two main contributions. In the first contribution, it proposes a transformer-based model with two segmentation heads for detecting cup-and-disc in retinal fundus images. The model has two different architectures based on the mon- olithic vanilla vision transformer (ViT) and the multiscale Swin transformer, and is trained on both original images and their cropped version over the optical disc. During inference, the test image is fed into the first head for initial cropping, and then the resulting cropped image is fed into the second head for generating a refined segmentation map. The architecture is learnable end-to-end and does not require any pre- or post-processing tasks, providing promising results compared to state-of-the- art methods. In the second contribution, the models are evaluated on a new benchmark problem consisting of eight retinal datasets using a leave-one-out evaluation protocol to assess the generali- zation capability of the models. The monolithic ViT-based model yields average dice scores of 93.55% and 85.33% for the cup and disc, respectively, and a cup-to-disc ratio of 0.054 over eight scenarios, while the hierarchical Swin-based model provides average dice scores of 94.45% and 85.31% for the cup and disc, respectively, and a cup-to-disc ratio of 0.055. The method demo is available at: https://huggingface.co/spaces/BigData-KSU/Glaucoma-Detection.

Full Text