The availability of automated, accurate, and robust gross tumor volume (GTV) segmentation algorithms is critical for the management of head and neck cancer (HNC) patients. In this work, we evaluated 3 state-of-the-art deep learning algorithms combined with 8 different loss functions for PET image segmentation using a comprehensive training set and evaluated its performance on an external validation set of HNC patients. 18F-FDG PET/CT images of 470 patients presenting with HNC on which manually defined GTVs serving as standard of reference were used for training (340 patients), evaluation (30 patients), and testing (100 patients from different centers) of these algorithms. PET image intensity was converted to SUVs and normalized in the range (0-1) using the SUVmax of the whole data set. PET images were cropped to 12 × 12 × 12 cm3 subvolumes using isotropic voxel spacing of 3 × 3 × 3 mm3 containing the whole tumor and neighboring background including lymph nodes. We used different approaches for data augmentation, including rotation (-15 degrees, +15 degrees), scaling (-20%, 20%), random flipping (3 axes), and elastic deformation (sigma = 1 and proportion to deform = 0.7) to increase the number of training sets. Three state-of-the-art networks, including Dense-VNet, NN-UNet, and Res-Net, with 8 different loss functions, including Dice, generalized Wasserstein Dice loss, Dice plus XEnt loss, generalized Dice loss, cross-entropy, sensitivity-specificity, and Tversky, were used. Overall, 28 different networks were built. Standard image segmentation metrics, including Dice similarity, image-derived PET metrics, first-order, and shape radiomic features, were used for performance assessment of these algorithms. The best results in terms of Dice coefficient (mean ± SD) were achieved by cross-entropy for Res-Net (0.86 ± 0.05; 95% confidence interval [CI], 0.85-0.87), Dense-VNet (0.85 ± 0.058; 95% CI, 0.84-0.86), and Dice plus XEnt for NN-UNet (0.87 ± 0.05; 95% CI, 0.86-0.88). The difference between the 3 networks was not statistically significant (P > 0.05). The percent relative error (RE%) of SUVmax quantification was less than 5% in networks with a Dice coefficient more than 0.84, whereas a lower RE% (0.41%) was achieved by Res-Net with cross-entropy loss. For maximum 3-dimensional diameter and sphericity shape features, all networks achieved a RE ≤ 5% and ≤10%, respectively, reflecting a small variability. Deep learning algorithms exhibited promising performance for automated GTV delineation on HNC PET images. Different loss functions performed competitively when using different networks and cross-entropy for Res-Net, Dense-VNet, and Dice plus XEnt for NN-UNet emerged as reliable networks for GTV delineation. Caution should be exercised for clinical deployment owing to the occurrence of outliers in deep learning-based algorithms.
Read full abstract