Deep Learning for Contour Quality Assurance on RTOG 0933

D Mumaw,E Porter,C.C Vu,P Fuentes,I.M Sala,N.K Myziuk,Z.A Siddiqui,T.M Guerrero

doi:10.1016/j.ijrobp.2022.07.941

Abstract

<h3>Purpose/Objective(s)</h3> To evaluate a CT-based deep learning (DL) hippocampal segmentation model, trained from a single-institutional dataset, and tested on the RTOG 0933 dataset and explore its potential for multi-institutional contour quality assurance (QA). <h3>Materials/Methods</h3> An attention-gated 3D ResNet deep learning (DL) model was trained on the task of semantic segmentation of the left (L) and right (R) hippocampus on a 390-patient Gamma Knife single-institution cohort using a ground truth of institutional observers (IOs). The model was then evaluated on the RTOG 0933 dataset by comparing to both the treating physician (TP) contours and blinded IO contours via Dice coefficient and Hausdorff distance (HD). The sensitivity and specificity of the DL model to capture discrepancies from the TP contour compared to the IO contour (a surrogate for central review contours) were assessed. Hippocampal avoidance whole brain radiotherapy plans were generated. The ability of DL and IO to identify unacceptable deviations of TP plans (per RTOG 0933 defined constraints) was assessed via Wilcoxon Signed-rank (WSR) and Cochran's Q. <h3>Results</h3> The DL model showed significantly greater agreement with IO contours compared to TP contours (DL:IO L/R Dice 73%/74%, HD 4.86/4.74 mm; DL:TP L/R Dice 62%/65%, HD 7.23/6.94 mm, all p<0.001). Using the RTOG protocol-defined passing metric of HD<7 mm as an agreement threshold, the DL model achieved an AUC L/R 0.80/0.79 in ability to discriminate TP contours from IO contours, with a false-negative rate of 17.2%/20.5%. WSR revealed that, when limited to subjects meeting the HD<7 mm agreement threshold, DL and IO chose populations that were not dosimetrically different from TP. When limited to subjects failing HD<7 mm, DL and IO chose populations with significant differences in hippocampal maximum doses (WSR=18.0, p=0.001; WSR=7.0, p=0.002) and PTV D98% (WSR=61.5, p=0.033; WSR=15, p=0.002) from TP. Cochran's Q showed no statistical difference between DL and IO in the rate of identification of RTOG-defined acceptable contours from TP (34.33, p=0.311). <h3>Conclusion</h3> Our study demonstrates the feasibility of using a single-institutional DL model to perform contour QA on a multi-institutional trial for the task of hippocampal segmentation. The DL model was capable of discriminating contours generated by treating physicians from a central reviewer and was able to identify a dosimetrically comparable population to the central reviewer. Further study is needed to assess optimal quality metrics and the generalizability of DL for contour QA.

Full Text