Deep learning with uncertainty estimation for automatic tumor segmentation in PET/CT of head and neck cancers: impact of model complexity, image processing and augmentation

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Objective. Target volumes for radiotherapy are usually contoured manually, which can be time-consuming and prone to inter- and intra-observer variability. Automatic contouring by convolutional neural networks (CNN) can be fast and consistent but may produce unrealistic contours or miss relevant structures. We evaluate approaches for increasing the quality and assessing the uncertainty of CNN-generated contours of head and neck cancers with PET/CT as input. Approach. Two patient cohorts with head and neck squamous cell carcinoma and baseline 18F-fluorodeoxyglucose positron emission tomography and computed tomography images (FDG-PET/CT) were collected retrospectively from two centers. The union of manual contours of the gross primary tumor and involved nodes was used to train CNN models for generating automatic contours. The impact of image preprocessing, image augmentation, transfer learning and CNN complexity, architecture, and dimension (2D or 3D) on model performance and generalizability across centers was evaluated. A Monte Carlo dropout technique was used to quantify and visualize the uncertainty of the automatic contours. Main results. CNN models provided contours with good overlap with the manually contoured ground truth (median Dice Similarity Coefficient: 0.75–0.77), consistent with reported inter-observer variations and previous auto-contouring studies. Image augmentation and model dimension, rather than model complexity, architecture, or advanced image preprocessing, had the largest impact on model performance and cross-center generalizability. Transfer learning on a limited number of patients from a separate center increased model generalizability without decreasing model performance on the original training cohort. High model uncertainty was associated with false positive and false negative voxels as well as low Dice coefficients. Significance. High quality automatic contours can be obtained using deep learning architectures that are not overly complex. Uncertainty estimation of the predicted contours shows potential for highlighting regions of the contour requiring manual revision or flagging segmentations requiring manual inspection and intervention.

Similar Papers
  • Research Article
  • Cite Count Icon 219
  • 10.1016/j.ultrasmedbio.2003.12.001
Watershed segmentation for breast tumor in 2-D sonography
  • May 1, 2004
  • Ultrasound in Medicine & Biology
  • Yu-Len Huang + 1 more

Watershed segmentation for breast tumor in 2-D sonography

  • Research Article
  • Cite Count Icon 4
  • 10.1186/s13014-022-01982-y
Deep learning tools for the cancer clinic: an open-source framework with head and neck contour validation
  • Feb 8, 2022
  • Radiation Oncology (London, England)
  • John C Asbach + 3 more

BackgroundWith the rapid growth of deep learning research for medical applications comes the need for clinical personnel to be comfortable and familiar with these techniques. Taking a proven approach, we developed a straightforward open-source framework for producing automatic contours for head and neck planning computed tomography studies using a convolutional neural network (CNN).MethodsAnonymized studies of 229 patients treated at our clinic for head and neck cancer from 2014 to 2018 were used to train and validate the network. We trained a separate CNN iteration for each of 11 common organs at risk, and then used data from 19 patients previously set aside as test cases for evaluation. We used a commercial atlas-based automatic contouring tool as a comparative benchmark on these test cases to ensure acceptable CNN performance. For the CNN contours and the atlas-based contours, performance was measured using three quantitative metrics and physician reviews using survey and quantifiable correction time for each contour.ResultsThe CNN achieved statistically better scores than the atlas-based workflow on the quantitative metrics for 7 of the 11 organs at risk. In the physician review, the CNN contours were more likely to need minor corrections but less likely to need substantial corrections, and the cumulative correction time required was less than for the atlas-based contours for all but two test cases.ConclusionsWith this validation, we packaged the code framework and trained CNN parameters and a no-code, browser-based interface to facilitate reproducibility and expansion of the work. All scripts and files are available in a public GitHub repository and are ready for immediate use under the MIT license. Our work introduces a deep learning tool for automatic contouring that is easy for novice personnel to use.

  • Research Article
  • Cite Count Icon 5
  • 10.1016/j.meddos.2020.09.004
Assessing tumor centrality in lung stereotactic ablative body radiotherapy (SABR): the effects of variations in bronchial tree delineation and potential for automated methods
  • Oct 13, 2020
  • Medical Dosimetry
  • Wsam Ghandourh + 6 more

Assessing tumor centrality in lung stereotactic ablative body radiotherapy (SABR): the effects of variations in bronchial tree delineation and potential for automated methods

  • Research Article
  • Cite Count Icon 21
  • 10.1088/1361-6560/ace307
Dosimetric comparison of autocontouring techniques for online adaptive proton therapy
  • Aug 11, 2023
  • Physics in Medicine & Biology
  • A Smolders + 7 more

Objective. Anatomical and daily set-up uncertainties impede high precision delivery of proton therapy. With online adaptation, the daily plan is reoptimized on an image taken shortly before the treatment, reducing these uncertainties and, hence, allowing a more accurate delivery. This reoptimization requires target and organs-at-risk (OAR) contours on the daily image, which need to be delineated automatically since manual contouring is too slow. Whereas multiple methods for autocontouring exist, none of them are fully accurate, which affects the daily dose. This work aims to quantify the magnitude of this dosimetric effect for four contouring techniques. Approach. Plans reoptimized on automatic contours are compared with plans reoptimized on manual contours. The methods include rigid and deformable registration (DIR), deep-learning based segmentation and patient-specific segmentation. Main results. It was found that independently of the contouring method, the dosimetric influence of using automatic OAR contours is small (<5% prescribed dose in most cases), with DIR yielding the best results. Contrarily, the dosimetric effect of using the automatic target contour was larger (>5% prescribed dose in most cases), indicating that manual verification of that contour remains necessary. However, when compared to non-adaptive therapy, the dose differences caused by automatically contouring the target were small and target coverage was improved, especially for DIR. Significance. The results show that manual adjustment of OARs is rarely necessary and that several autocontouring techniques are directly usable. Contrarily, manual adjustment of the target is important. This allows prioritizing tasks during time-critical online adaptive proton therapy and therefore supports its further clinical implementation.

  • Research Article
  • 10.1093/bjr/tqag036
Automatic segmentation of clinical target volume for radiation therapy in breast-conserving patients and exploration of clinical factors influential to its performance.
  • Mar 10, 2026
  • The British journal of radiology
  • Maochen Zhang + 10 more

To develop and validate a deep learning model for whole breast clinical target volume (CTV) contouring and evaluate clinical features affecting its performance. Five datasets with 857 patients from a single center were used. Dataset 1 (n = 300) trained and tested the model. Dataset 2 (n = 10) evaluated contouring time and dosimetric parameters. Datasets 3 (n = 20) and 4 (n = 10) were for clinical evaluation. Dataset 5 (n = 517) identified clinical factors influencing auto-contouring accuracy. Model performance was assessed using Dice Similarity Coefficient (DSC) and 95th percentile Hausdorff Distance (HD95). The median DSC and HD95 for left- and right-sided models in Dataset 1 were 0.941, 1.75 mm and 0.937, 2.47 mm, respectively. In Dataset 2, both auto-contouring and auto-contouring with manual corrections were significantly faster than manual contouring (P = 0.005 for both), while still achieving clinically acceptable dosimetric results. In Dataset 3, two physicians rated automatic and manual contours as equivalent (P = 0.214, P = 0.075), while the other rated auto-contouring higher (P < 0.001). In Dataset 4, the auto-contouring model outperformed 1/5 physicians by DSC (P = 0.009) and 3/5 by HD95 (P = 0.015, P = 0.007, P = 0.017). In Dataset 5, peripheral tumor-bed and low-density breast tissue were associated with lower DSC (P < 0.001 for both) and higher HD95 (P < 0.001 for both). Cases without unfavorable factors performed better than those with (P < 0.001 for both). The proposed model demonstrated acceptable accuracy, consistency, and efficiency in breast CTV contouring. Peripheral tumor-bed and low-density breast tissue reduced auto-contouring performance. The characteristics of challenging cases in whole breast CTV auto-contouring should be identified.

  • Research Article
  • Cite Count Icon 7
  • 10.3934/mbe.2021371
A deep learning based automatic segmentation approach for anatomical structures in intensity modulation radiotherapy.
  • Jan 1, 2021
  • Mathematical Biosciences and Engineering
  • Han Zhou + 5 more

To evaluate the automatic segmentation approach for organ at risk (OARs) and compare the parameters of dose volume histogram (DVH) in radiotherapy. Thirty-three patients were selected to contour OARs using automatic segmentation approach which based on U-Net, applying them to a number of the nasopharyngeal carcinoma (NPC), breast, and rectal cancer respectively. The automatic contours were transferred to the Pinnacle System to evaluate contour accuracy and compare the DVH parameters. The time for manual contour was 56.5 ± 9, 23.12 ± 4.23 and 45.23 ± 2.39min for the OARs of NPC, breast and rectal cancer, and for automatic contour was 1.5 ± 0.23, 1.45 ± 0.78 and 1.8 ± 0.56 min. Automatic contours of Eye with the best Dice-similarity coefficients (DSC) of 0.907 ± 0.02 while with the poorest DSC of 0.459 ± 0.112 of Spinal Cord for NPC; And Lung with the best DSC of 0.944 ± 0.03 while with the poorest DSC of 0.709 ± 0.1 of Spinal Cord for breast; And Bladder with the best DSC of 0.91 ± 0.04 while with the poorest DSC of 0.43 ± 0.1 of Femoral heads for rectal cancer. The contours of Spinal Cord in H & N had poor results due to the division of the medulla oblongata. The contours of Femoral head, which different from what we expect, also due to manual contour result in poor DSC. The automatic contour approach based deep learning method with sufficient accuracy for research purposes. However, the value of DSC does not fully reflect the accuracy of dose distribution, but can cause dose changes due to the changes in the OARs volume and DSC from the data. Considering the significantly time-saving and good performance in partial OARs, the automatic contouring also plays a supervisory role.

  • Research Article
  • Cite Count Icon 52
  • 10.1016/j.applthermaleng.2021.116849
Deep learning strategies for critical heat flux detection in pool boiling
  • Mar 13, 2021
  • Applied Thermal Engineering
  • Seyed Moein Rassoulinejad-Mousavi + 7 more

Deep learning strategies for critical heat flux detection in pool boiling

  • Research Article
  • Cite Count Icon 3
  • 10.5114/jcb.2022.112814
Automatic contouring using deformable image registration for tandem-ring or tandem-ovoid brachytherapy.
  • Jan 1, 2022
  • Journal of Contemporary Brachytherapy
  • Yagiz Yedekci + 3 more

PurposeTo investigate the effectiveness of deformable image registration (DIR)-based automatic contouring for tandem-ring (T-R) or tandem-ovoid (T-O) 3-dimensional computed tomography (CT)-based image-guided brachytherapy (IGBT).Material and methodsCT images of 28 patients with intact cervical cancer were retrospectively analyzed. Selected group had T-R or T-O insertion for IGBT. Hybrid DIR was performed between first fraction CT and subsequent CTs for IGBT. First IGBT CT images were reference images. All DIRs were performed based on these first IGBT CT scans. Contour similarities between manual and automated segmentations were evaluated with dice similarity coefficient (DSC) score. Mean volumes of the structures were delineated manually and automatically compared. Finally, dosimetric comparisons were performed in order to obtain how contour differences affect the doses to target and organs at risk (OARs).ResultsIn general, mean volumes of the automatic contours were larger than manual contours for both T-R and T-O insertions. However, the difference in volume was statistically significant for the small bowel only (p < 0.05 and p < 0.01 for T-R and T-O, respectively). The DSC scores were small for the small bowel and the sigmoid in both applicator sets. When the two different applicator sets were compared, the performance of DIR-based contour propagation for the rectum was worse in T-O compared to T-R application. Dosimetric comparisons showed that volume differences between the manual and propagated contours did not affect dose-volume parameters. The treatment plans based on manually contoured targets also well-covered DIR contours. The average time for DIR was 2.0 ±0.1 minutes per fraction compared to 14.0 ±0.4 minutes in manual contouring (p < 0.001).ConclusionsDIR-based automatic contouring of the structures seems successful for both the T-R and T-O applications in cervical IGBT. DIR significantly decreased the time for contouring. Our results indicate that automatic contouring in IGBT is safe and time-saving.

  • Abstract
  • Cite Count Icon 1
  • 10.1016/j.ijrobp.2021.07.554
Automatic Contouring Using Deformable Image Registration for Tandem-Ring or Tandem-Ovoid Brachytherapy
  • Oct 22, 2021
  • International Journal of Radiation Oncology*Biology*Physics
  • F.Y Yedekci + 3 more

Automatic Contouring Using Deformable Image Registration for Tandem-Ring or Tandem-Ovoid Brachytherapy

  • Research Article
  • 10.1118/1.4924220
SU‐E‐J‐134: Optimizing Technical Parameters for Using Atlas Based Automatic Segmentation for Evaluation of Contour Accuracy Experience with Cardiac Structures From NRG Oncology/RTOG 0617
  • Jun 1, 2015
  • Medical Physics
  • J Yu + 12 more

Purpose:Accurate contour delineation is crucial for radiotherapy. Atlas based automatic segmentation tools can be used to increase the efficiency of contour accuracy evaluation. This study aims to optimize technical parameters utilized in the tool by exploring the impact of library size and atlas number on the accuracy of cardiac contour evaluation.Methods:Patient CT DICOMs from RTOG 0617 were used for this study. Five experienced physicians delineated the cardiac structures including pericardium, atria and ventricles following an atlas guideline. The consistency of cardiac structured delineation using the atlas guideline was verified by a study with four observers and seventeen patients. The CT and cardiac structure DICOM files were then used for the ABAS technique.To study the impact of library size (LS) and atlas number (AN) on automatic contour accuracy, automatic contours were generated with varied technique parameters for five randomly selected patients. Three LS (20, 60, and 100) were studied using commercially available software. The AN was four, recommended by the manufacturer. Using the manual contour as the gold standard, Dice Similarity Coefficient (DSC) was calculated between the manual and automatic contours. Five‐patient averaged DSCs were calculated for comparison for each cardiac structure.In order to study the impact of AN, the LS was set 100, and AN was tested from one to five. The five‐patient averaged DSCs were also calculated for each cardiac structure.Results:DSC values are highest when LS is 100 and AN is four. The DSC is 0.90±0.02 for pericardium, 0.75±0.06 for atria, and 0.86±0.02 for ventricles.Conclusion:By comparing DSC values, the combination AN=4 and LS=100 gives the best performance.This project was supported by NCI grants U24CA12014, U24CA180803, U10CA180868, U10CA180822, PA CURE grant and Bristol‐Myers Squibb and Eli Lilly.

  • Research Article
  • 10.61440/jsdr.2025.v3.40
Enhancing Dental Caries Identification with Deep Learning: A Study of Convolutional Neural Networks and Transfer Learning Approaches
  • Dec 31, 2025
  • Journal of Stomatology &amp; Dental Research
  • Fayqa Mannan

Background: Dental caries is a prevalent oral health issue, and early diagnosis using X-ray images can significantly improve treatment outcomes. Deep learning techniques have been increasingly employed for automated detection of dental caries in radiographic images. Objectives: This study aims to evaluate the effectiveness of deep learning models, including Convolutional Neural Networks (CNNs) and transfer learning approaches, in identifying dental caries using periapical radiographs. Methods: We utilized a traditional CNN model along with transfer learning models, including Visual Geometry Group (VGG16, VGG19), ResNet50, and Inception V3. The CNN model consisted of three sets of 2D convolutional layers followed by activation, max-pooling, flatten, dense layers, dropout, and final activation layers. For the transfer learning models, the top convolutional layers were frozen to prevent retraining, allowing only the last layers to be trained. Hyperparameters were optimized using a grid search approach, and model performance was validated using the Shuffle-Split-Cross (SSC) validation method. Results: Ten images were generated for each original image, resulting in a total of 1,150 training dataset images. The accuracy achieved by the CNN, VGG16, VGG19, ResNet50, and Inception V3 models was 90%, 96%, 73%, 70%, and 73%, respectively. Among these, VGG16 exhibited the highest accuracy. Conclusions: The findings demonstrate that transfer learning, particularly with VGG16, is highly effective in diagnosing dental caries from periapical radiographs. These results highlight the potential of deep learning models for improving automated dental diagnostics. Transfer learning, especially with VGG-16, achieved the highest accuracy (96%) in this study, outperforming both traditional CNN and related studies, highlighting its effectiveness for dental caries detection.

  • Research Article
  • Cite Count Icon 26
  • 10.1186/s12880-021-00609-0
Combining weakly and strongly supervised learning improves strong supervision in Gleason pattern classification
  • May 8, 2021
  • BMC Medical Imaging
  • Sebastian Otálora + 3 more

BackgroundOne challenge to train deep convolutional neural network (CNNs) models with whole slide images (WSIs) is providing the required large number of costly, manually annotated image regions. Strategies to alleviate the scarcity of annotated data include: using transfer learning, data augmentation and training the models with less expensive image-level annotations (weakly-supervised learning). However, it is not clear how to combine the use of transfer learning in a CNN model when different data sources are available for training or how to leverage from the combination of large amounts of weakly annotated images with a set of local region annotations. This paper aims to evaluate CNN training strategies based on transfer learning to leverage the combination of weak and strong annotations in heterogeneous data sources. The trade-off between classification performance and annotation effort is explored by evaluating a CNN that learns from strong labels (region annotations) and is later fine-tuned on a dataset with less expensive weak (image-level) labels.ResultsAs expected, the model performance on strongly annotated data steadily increases as the percentage of strong annotations that are used increases, reaching a performance comparable to pathologists (kappa = 0.691 pm 0.02). Nevertheless, the performance sharply decreases when applied for the WSI classification scenario with kappa = 0.307 pm 0.133. Moreover, it only provides a lower performance regardless of the number of annotations used. The model performance increases when fine-tuning the model for the task of Gleason scoring with the weak WSI labels kappa = 0.528 pm 0.05.ConclusionCombining weak and strong supervision improves strong supervision in classification of Gleason patterns using tissue microarrays (TMA) and WSI regions. Our results contribute very good strategies for training CNN models combining few annotated data and heterogeneous data sources. The performance increases in the controlled TMA scenario with the number of annotations used to train the model. Nevertheless, the performance is hindered when the trained TMA model is applied directly to the more challenging WSI classification problem. This demonstrates that a good pre-trained model for prostate cancer TMA image classification may lead to the best downstream model if fine-tuned on the WSI target dataset. We have made available the source code repository for reproducing the experiments in the paper: https://github.com/ilmaro8/Digital_Pathology_Transfer_Learning

  • Research Article
  • Cite Count Icon 16
  • 10.1155/2020/1475164
The Real-Time Mobile Application for Classifying of Endangered Parrot Species Using the CNN Models Based on Transfer Learning
  • Mar 9, 2020
  • Mobile Information Systems
  • Daegyu Choe + 2 more

Among the many deep learning methods, the convolutional neural network (CNN) model has an excellent performance in image recognition. Research on identifying and classifying image datasets using CNN is ongoing. Animal species recognition and classification with CNN is expected to be helpful for various applications. However, sophisticated feature recognition is essential to classify quasi-species with similar features, such as the quasi-species of parrots that have a high color similarity. The purpose of this study is to develop a vision-based mobile application to classify endangered parrot species using an advanced CNN model based on transfer learning (some parrots have quite similar colors and shapes). We acquired the images in two ways: collecting them directly from the Seoul Grand Park Zoo and crawling them using the Google search. Subsequently, we have built advanced CNN models with transfer learning and trained them using the data. Next, we converted one of the fully trained models into a file for execution on mobile devices and created the Android package files. The accuracy was measured for each of the eight CNN models. The overall accuracy for the camera of the mobile device was 94.125%. For certain species, the accuracy of recognition was 100%, with the required time of only 455 ms. Our approach helps to recognize the species in real time using the camera of the mobile device. Applications will be helpful for the prevention of smuggling of endangered species in the customs clearance area.

  • Research Article
  • 10.58578/ajstea.v4i1.8252
Comparison of CNN and CNN-LSTM Performance in Facial Expression Classification Based on FER2013 Dataset
  • Jan 5, 2026
  • Asian Journal of Science, Technology, Engineering, and Art
  • Putu Ananda Adi Savitri + 2 more

Although facial expression recognition (FER) using deep learning has received increasing attention in prior studies, research specifically addressing the comparative effectiveness of sequential modeling on static image data remains limited. This study aims to evaluate and compare the performance of a pure Convolutional Neural Network (CNN) model and a hybrid CNN–Long Short-Term Memory (CNN-LSTM) model in classifying seven basic facial expressions using the static FER2013 dataset. A quantitative experimental approach with a comparative study design was employed, utilizing the publicly available FER2013 dataset and two custom deep learning architectures. Data were obtained from FER2013 and model performance was evaluated using accuracy, precision, recall, F1-score, and AUC-ROC metrics. The findings indicate that the pure CNN model significantly outperformed the CNN-LSTM model, achieving a testing accuracy of 63.25% compared to 46.82% for the hybrid model; the CNN provided strong discrimination for visually distinct classes but continued to struggle with visually similar expressions. These results contribute to the theoretical development of deep learning architecture selection and expand understanding of the application of sequence models to static data. The study concludes that data characteristics (static versus temporal) play a crucial role in determining model effectiveness, and that for static datasets such as FER2013, a pure CNN constitutes the more appropriate choice. The implications of this research include theoretical contributions to the growing literature on deep learning-based FER and practical recommendations for developers to prioritize CNN architectures for non-temporal image classification tasks, while also highlighting opportunities for future research on transfer learning and attention mechanisms to better capture subtle expression nuances.

  • Research Article
  • Cite Count Icon 2
  • 10.1016/j.ejrad.2025.112168
Diagnosis of thyroid cartilage invasion by laryngeal and hypopharyngeal cancers based on CT with deep learning.
  • Aug 1, 2025
  • European journal of radiology
  • Yuki Takano + 8 more

Diagnosis of thyroid cartilage invasion by laryngeal and hypopharyngeal cancers based on CT with deep learning.

Save Icon
Up Arrow
Open/Close