Evaluating the sensitivity of deep learning to inter-reader variations in lesion delineations on bi-parametric MRI in identifying clinically significant prostate cancer

Ansh Roge,Rakesh Shiradkar,Amogh Hiremath,Sadhna Verma,Ryan Ward,Halimat Olaniyan,Leonardo Kayat Bittencourt,Michael Sobota,Justin Ream,Anant Madabhushi,Andrei Purysko,Sree Harsha Tirumani,Ravi K Samala,Khan M Iftekharuddin,Maciej A Mazurowski

doi:10.1117/12.2613245

Abstract

Deep learning based convolutional neural networks (CNNs) for prostate cancer (PCa) risk stratification employ radiologist delineated regions of interest (ROIs) on MRI. These ROIs contain the reader’s interpretation of the region of PCa. Variations in reader annotations change the features that are extracted from the ROIs, which may in turn affect classification performance of CNNs. In this study, we sought to analyze the effect of variations in inter-reader delineations of PCa ROIs on training of CNNs with regards to distinguishing clinically significant (csPCa) and insignificant PCa (ciPCa). We employed 180 patient studies (n=274 lesions) from 3 cohorts who underwent 3T multi-parametric MRI followed by MRI-targeted biopsy and/or radical prostatectomy. ISUP Gleason grade groups (GGG) obtained from pathology were used to determine csPCa (GGG≥2) and ciPCa (GGG=1). 5 experienced radiologists, with over 5 years of experience in prostate imaging, delineated PCa ROIs on bi-parametric MRI (bpMRI including T2 weighted (T2W) and diffusion weighted (DWI) sequences) within the training set (n1=160 lesions) using targeted biopsy locations. Patches were extracted using the ROIs which were then used to train individual CNNs (N1-N5) using the SqueezeNet architecture. The average volume for readerdelineated ROIs used for training varied greatly, ranging between 1106 and 2107 mm across all readers. The resulting networks showed no significant difference in classification performance (AUC= 0.82 ± 0.02) indicating that they were relatively robust to inter-reader variations in ROI. These models were evaluated on independent test sets (n2=85 lesions, n3=29 lesions) where ROIs were obtained by co-registration of MRI with post-surgical pathology, unaffected by inter-reader variations in ROIs. Network performance across D2 and D3 was 0.80±0.02 and 0.62 ± 0.03, respectively. The CNN predictions were moderately consistent, with ICC(2,1) scores across D2 and D3 being 0.74 and 0.54, respectively. Higher agreement in ROI overlap produced higher correlation in predictions on external test sets (R = 0.89, p < 0.05). Furthermore, higher average ROI volume produced greater AUC scores on D3, indicating that comprehensive ROIs may provide more features for DL networks to use in classification. Inter-reader variations in ROIs on MRI may influence the reliability and generalizability of CNNs trained for PCa risk stratification.

Full Text