To explore a centralized approach to build test sets and assess the performance of an artificial intelligence medical device (AIMD) which is intended for computer-aided diagnosis of diabetic retinopathy (DR). A framework was proposed to conduct data collection, data curation, and annotation. Deidentified colour fundus photographs were collected from 11 partner hospitals with raw labels. Photographs with sensitive information or authenticity issues were excluded during vetting. A team of annotators was recruited through qualification examinations and trained. The annotation process included three steps: initial annotation, review, and arbitration. The annotated data then composed a standardized test set, which was further imported to algorithms under test (AUT) from different developers. The algorithm outputs were compared with the final annotation results (reference standard). The test set consists of 6327 digital colour fundus photographs. The final labels include 5 stages of DR and non-DR, as well as other ocular diseases and photographs with unacceptable quality. The Fleiss Kappa was 0.75 among the annotators. The Cohen's kappa between raw labels and final labels is 0.5. Using this test set, five AUTs were tested and compared quantitatively. The metrics include accuracy, sensitivity, and specificity. The AUTs showed inhomogeneous capabilities to classify different types of fundus photographs. This article demonstrated a workflow to build standardized test sets and conduct algorithm testing of the AIMD for computer-aided diagnosis of diabetic retinopathy. It may provide a reference to develop technical standards that promote product verification and quality control, improving the comparability of products.
Read full abstract