Evaluating Clinical Acceptability of Organs-at-Risk Segmentation in Head & Neck Cancer (HNC) by Open-Source 3D Convolutional Neural Networks (CNNs)

J Marsilla,J.W Kim,S Kim,D Tkachuk,A.J Hope,B Haibe-Kains

doi:10.1016/j.ijrobp.2021.07.485

Abstract

<h3>Purpose/Objective(s)</h3> To evaluate the performance and clinical acceptability of the highest objective performing 3D convolutional neural network models developed to contour regions of interest (ROIs) in medical images. <h3>Materials/Methods</h3> Radiation treatment planning data from 582 HNC patients treated at a large tertiary cancer center were used for this study. Eleven open-source 3D segmentation models (6 UNETs, 3 DenseNets, 1 3D pyramid abstraction network, and 1 3D recurrent slice-wise attention network), originally engineered for medical image segmentation, were trained to segment 19 OARs. The same 80/10/10 split was used for training, validating and testing corresponding to 479, 44 and 59 scans. A model's performance was ranked by volumetric Dice similarity coefficient (DSC) and Hausdorff distance (HD). The best performing model was selected and re-trained on full-resolution patient scans and validated using sliding-window inference. With the best model, two radiation oncologists assessed clinical acceptability, using 5-scale ratings of 5 for "anatomically perfect, not requiring editing" and score 1 for "anatomically incorrect, unusable for planning purpose." At least 10 cases were randomly selected and deep learning-based (DLCs) and manual contours (MCs), for the entire range of the OAR, were presented to the blinded observers. The mean ratings between DLC and MC were compared using Wilcoxon signed rank test and Student's T-test for the groups where significant differences between the groups represented the ability to distinguish between the MC and DLC contours by the observers. <h3>Results</h3> A simple 3D Unet performed the best among the 11 models when comparing average DSC (0.75 ± 0.12) and HD (mean HD 0.44 ± 0.09) for every OAR. Mandible (0.90 ± 0.03), brainstem (0.89 ± 0.04), and spinal cord (0.85 ± 0.05) received the highest mean DSC, and chiasm (0.38 ± 0.19), optic nerves (0.65 ± 0.15), and lips (0.70 ± 0.09) the lowest mean DSC. Lens (0.052 ± 0.019), optic nerve (0.069 ± 0.026), and chiasm (0.075 ± 0.02) received the lowest mean HD, and spinal cord (1.15 ± 0.15), mandible (1.11 ± 0.18), and brachial plexus (0.90 ± 0.12) the highest mean HD. For clinical acceptability, MC received a significantly higher rating than DLC (3.75 ± 0.77 vs. 3.23 ± 0.86) when all OARs were considered (<i>P</i> < 0.01). When evaluating OARs individually, MC showed significantly higher ratings for brainstem, esophagus, larynx, eyes, optic nerves, while lips, whereas parotids, acoustics, and lenses were indistinguishable. <h3>Conclusion</h3> Simple 3D architectures consistently outcompete more complex networks by quantitative measures. Qualitative assessment for clinical acceptability may not agree with quantitative performance, especially when the entire range of OARs is evaluated.

Full Text