Self-supervised learning (SSL) is an approach to extract useful feature representations from unlabeled data, and enable fine-tuning on downstream tasks with limited labeled examples. Self-pretraining is a SSL approach that uses curated downstream task dataset for both pretraining and fine-tuning. Availability of large, diverse, and uncurated public medical image sets presents the opportunity to potentially create foundation models by applying SSL in the "wild" that are robust to imaging variations. However, the benefit of wild- versus self-pretraining has not been studied for medical imageanalysis. Compare robustness of wild versus self-pretrained models created using convolutional neural network (CNN) and transformer (vision transformer [ViT] and hierarchical shifted window [Swin]) models for non-small cell lung cancer (NSCLC) segmentation from 3D computed tomography (CT) scans. CNN, ViT, and Swin models were wild-pretrained using unlabeled 10,412 3D CTs sourced from the cancer imaging archive and internal datasets. Self-pretraining was applied to same networks using a curated public downstream task dataset (n = 377) of patients with NSCLC. Pretext tasks introduced in self-distilled masked image transformer were used for both pretraining approaches. All models were fine-tuned to segment NSCLC (n = 377 training dataset) and tested on two separate datasets containing early (public n = 156) and advanced stage (internal n = 196) NSCLC. Models were evaluated in terms of: (a) accuracy, (b) robustness to image differences from contrast, slice thickness, and reconstruction kernels, and (c) impact of pretext tasks for pretraining. Feature reuse was evaluated using centered kernel alignment. Wild-pretrained Swin models resulted in higher feature reuse at earlier level layers and increased feature differentiation close to output. Wild-pretrained Swin outperformed self-pretrained models for analyzed imaging acquisitions. Neither ViT nor CNN showed a clear benefit of wild-pretraining compared to self-pretraining. Masked image prediction pretext task that forces networks to learn the local structure resulted in higher accuracy compared to contrastive task that models global image information. Wild-pretrained Swin networks were more robust to analyzed CT imaging differences for lung tumor segmentation than self-pretrained methods. ViT and CNN models did not show a clear benefit for wild-pretraining overself-pretraining.
Read full abstract