To validate the performance of autonomous diabetic retinopathy (DR) grading by comparing a human grader and a self-developed deep-learning (DL) algorithm with gold-standard evaluation. We included 500, 6-field retinal images graded by an expert ophthalmologist (gold standard) according to the International Clinical Diabetic Retinopathy Disease Severity Scale as represented with DR levels 0-4 (97, 100, 100, 103, 100, respectively). Weighted kappa was calculated to measure the DR classification agreement for (1) a certified human grader without, and (2) with assistance from a DL algorithm and (3) the DL operating autonomously. Using any DR (level 0 vs. 1-4) as a cutoff, we calculated sensitivity, specificity, as well as positive and negative predictive values (PPV and NPV). Finally, we assessed lesion discrepancies between Model 3 and the gold standard. As compared to the gold standard, weighted kappa for Models 1-3 was 0.88, 0.89 and 0.72, sensitivities were 95%, 94% and 78% and specificities were 82%, 84% and 81%. Extrapolating to a real-world DR prevalence of 23.8%, the PPV were 63%, 64% and 57% and the NPV were 98%, 98% and 92%. Discrepancies between the gold standard and Model 3 were mainly incorrect detection of artefacts (n = 49), missed microaneurysms (n = 26) and inconsistencies between the segmentation and classification (n = 51). While the autonomous DL algorithm for DR classification only performed on par with a human grader for some measures in a high-risk population, extrapolations to a real-world population demonstrated an excellent 92% NPV, which could make it clinically feasible to use autonomously to identify non-DR patients.
Read full abstract