Performance of a Breast Cancer Detection AI Algorithm Using the Personal Performance in Mammographic Screening Scheme.

Yan Chen,Iain T Darker,Jonathan J James,Adnan G Taib

doi:10.1148/radiol.223299

Abstract

Background The Personal Performance in Mammographic Screening (PERFORMS) scheme is used to assess reader performance. Whether this scheme can assess the performance of artificial intelligence (AI) algorithms is unknown. Purpose To compare the performance of human readers and a commercially available AI algorithm interpreting PERFORMS test sets. Materials and Methods In this retrospective study, two PERFORMS test sets, each consisting of 60 challenging cases, were evaluated by human readers between May 2018 and March 2021 and were evaluated by an AI algorithm in 2022. AI considered each breast separately, assigning a suspicion of malignancy score to features detected. Performance was assessed using the highest score per breast. Performance metrics, including sensitivity, specificity, and area under the receiver operating characteristic curve (AUC), were calculated for AI and humans. The study was powered to detect a medium-sized effect (odds ratio, 3.5 or 0.29) for sensitivity. Results A total of 552 human readers interpreted both PERFORMS test sets, consisting of 161 normal breasts, 70 malignant breasts, and nine benign breasts. No difference was observed at the breast level between the AUC for AI and the AUC for human readers (0.93% and 0.88%, respectively; P = .15). When using the developer's suggested recall score threshold, no difference was observed for AI versus human reader sensitivity (84% and 90%, respectively; P = .34), but the specificity of AI was higher (89%) than that of the human readers (76%, P = .003). However, it was not possible to demonstrate equivalence due to the size of the test sets. When using recall thresholds to match mean human reader performance (90% sensitivity, 76% specificity), AI showed no differences inperformance, with a sensitivity of 91% (P =. 73) and a specificity of 77% (P = .85). Conclusion Diagnostic performance of AI was comparable with that of the average human reader when evaluating cases from two enriched test sets from the PERFORMS scheme. © RSNA, 2023 See also the editorial by Philpotts in this issue.

Full Text