Auto-segmentation of organs-at-risk (OARs) in the head and neck (HN) on computed tomography (CT) images is a time-consuming component of the radiation therapy pipeline that suffers from inter-observer variability. Deep learning (DL) has shown state-of-the-art results in CT auto-segmentation, with larger and more diverse datasets showing better segmentation performance. Institutional CT auto-segmentation datasets have been small historically (n<50) due to the time required for manual curation of images and anatomical labels. Recently, large public CT auto-segmentation datasets (n>1000 aggregated) have become available through online repositories such as The Cancer Imaging Archive. Transfer learning is a technique applied when training samples are scarce, but a large dataset from a closely related domain is available. The purpose of this study was to investigate whether a large public dataset could be used in place of an institutional dataset (n>500), or to augment performance via transfer learning, when building HN OAR auto-segmentation models for institutional use. Auto-segmentation models were trained on a large public dataset (public models) and a smaller institutional dataset (institutional models). The public models were fine-tuned on the institutional dataset using transfer learning (transfer models). We assessed both public model generalizability and transfer model performance by comparison with institutional models. Additionally, the effect of institutional dataset size on both transfer and institutional models was investigated. All DL models used a high-resolution, two-stage architecture based on the popular 3D U-Net. Model performance was evaluated using five geometric measures: the dice similarity coefficient (DSC), surface DSC, 95th percentile Hausdorff distance, mean surface distance (MSD), and added path length. For a small subset of OARs (left/right optic nerve, spinal cord, left submandibular), the public models performed significantly better (p<0.05) than, or showed no significant difference to, the institutional models under most of the metrics examined. For the remaining OARs, the public models were inferior to the institutional models, although performance differences were small (DSC≤0.03, MSD<0.5mm) for seven OARs (brainstem, left/right lens, left/right parotid, mandible, right submandibular). The transfer models performed significantly better than the institutional models for seven OARs (brainstem, right lens, left/right optic nerve, left/right parotid, spinal cord) with a small margin of improvement (DSC≤0.02, MSD<0.4mm). When numbers of institutional training samples were limited, public and transfer models outperformed the institutional models for most OARs (brainstem, left/right lens, left/right optic nerve, left/right parotid, spinal cord, and left/right submandibular). Training auto-segmentation models with public data alone was suitable for a small number of OARs. Using only public data incurred a small performance deficit for most other OARs, when compared with institutional data alone, but may be preferable over time-consuming curation of a large institutional dataset. When a large institutional dataset was available, transfer learning with models pretrained on a large public dataset provided a modest performance improvement for several OARs. When numbers of institutional samples were limited, using the public dataset alone, or as a pretrained model, was beneficial for most OARs.
Read full abstract