Abstract

<h3>Purpose/Objective(s)</h3> Segmentation of head and neck (HN) organs at risk (OARs) is a laborious process. Here we introduce and validate a newly developed deep-learning-based auto-segmentation program and compare with a commercially available system, and the same system trained on internal data. <h3>Materials/Methods</h3> A total of 864 previously treated HN cancer patients were available to train and evaluate a prototype deep learning-based normal tissue 3D auto-segmentation algorithm. The algorithm is based on a fully convolutional network with U-Net and V-net features combined as the backbone of the network. A Dice loss function with the Adam optimizer was used in training the models with between 150-500 patients used per model. The OARs were delineated by a single experienced physician (gold data). A subset of 75 cases was withheld from training and used for validation. On those, we generated new OAR sets with three different deep-learning models and compared to the gold data: A) the prototype model trained with gold data, B) a commercial software package trained with the gold data (n=213), and C) the same commercial software with the model trained at another institution (n=589). The agreement between the gold data and auto-segmented structures was evaluated with Dice similarity coefficient (DSC) and voxel-penalty metric that penalizes each missing or extra pixel as a function of distance with forgiveness threshold distance. An ANOVA test with post hoc pair-wise analysis was performed to assess the differences in those metrics. The auto-segmented contours were also qualitatively evaluated by the physician on a scale of 0-5. <h3>Results</h3> The average DSC and voxel penalty metric scores for algorithms A, B, and C across all OARs in the 75 evaluation cases were 0.80/77.68, 0.74/62.75, and 0.66/45.26, respectively. The difference in mean DSC scores was statistically significant (p<0.05) for all 11 OARs where the data for all three algorithms were available. The A/B difference was significant in 6 OARs. Algorithm A scored the highest DSC and voxel penalty metric score in all OARs except for the pharyngeal constrictors. All OARs except for the pharyngeal constrictors showed DSC≥0.7 with algorithm A. For three structures the mean DSC was significantly different between the same algorithm trained at different institutions(B/C). From the qualitative evaluation by a blinded expert, 51 structures (20.2%) of model A were clinically acceptable without edits. The percentages of ‘clinically useful' scores were largest in model A (95.2%) followed by model B (88.0%) and C (80.6%). <h3>Conclusion</h3> The prototype algorithm had improved performance compared to a commercial algorithm, even when trained on data from the same institution. Auto segmentation results can differ significantly when the same algorithm is trained on data from different institutions.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call