Abstract

<h3>Purpose/Objective(s)</h3> To test the hypothesis that deep learning (DL) techniques, using full dose distributions, can outperform machine learning (ML) methods, using dose summary statistics, in the prediction of osteoradionecrosis (ORN) resulting from head and neck cancer (HNC) radiotherapy (RT). <h3>Materials/Methods</h3> 1259 subjects from a single institution were identified who received HNC RT with curative intent. All 1259 subjects were included in the ML study and 1236 subjects with available dose maps and mandible contours were included in the DL study. After two years of follow-up, 173 patients developed ORN of any grade and 1086 remained ORN free (171 ORN+/1064 ORN- in the DL cohort). The ML methods, including logistic regression (LR), random forest (RF), support vector machine (SVM), principal component regression (PCR), and XGBoost, predict ORN status using subject dose summary statistics. The DL methods, including ResNet, DenseNet, DenseNet+ResNet ensemble, and autoencoder architectures, used subject 3D dose maps constrained to a bounding box around the mandible contour to predict ORN status. The autoencoder architecture uses bottleneck features with convolutional layers for prediction. The impact of training set size on DL performance was evaluated by retraining the architectures on decreasing ratios of the original training dataset (100% to 10% in 10% decrements). Model prediction performance was quantified using recall, precision, balanced accuracy, and area under the precision recall curve (AUPRC). The ML results are the average of 10-fold stratified cross-validation with 3 repeats whereas DL results are from a withheld test set (650/217/369 train/validation/test case split with 111/12/48 ORN+ cases per set, respectively). Class imbalance in the DL models was handled by randomly oversampling ORN+ cases in the training set to match the number of ORN- cases. <h3>Results</h3> The table shows the ML and DL ORN prediction results. Decreasing the amount of training data had no impact on DL performance; in the extreme of training the DL models on 10% of the training data, the balanced accuracy and F1 score did not decrease. <h3>Conclusion</h3> The traditional ML models had superior performance compared to the DL models. The lack of improvement in DL performance when increasing the amount of available training data suggests that either significantly more data is needed for DL model construction and/or that low-level dose image features are not powerful for this task. The poor DL performance despite a relatively large training cohort suggest additional imaging modalities in conjunction with 3D dose maps should be explored.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call