Automatic segmentation techniques based on Convolutional Neural Networks (CNNs) are widely adopted to automatically identify any structure of interest from a medical image, as they are not time consuming and not subject to high intra- and inter-operator variability. However, the adoption of these approaches in clinical practice is slowed down by some factors, such as the difficulty in providing an accurate quantification of their uncertainty. This work aims to evaluate the uncertainty quantification provided by two Bayesian and two non-Bayesian approaches for a multi-class segmentation problem, and to compare the risk propensity among these approaches, considering CT images of patients affected by renal cancer (RC). Four uncertainty quantification approaches were implemented in this work, based on a benchmark CNN currently employed in medical image segmentation: two Bayesian CNNs with different regularizations (Dropout and DropConnect), named BDR and BDC, an ensemble method (Ens) and a test-time augmentation (TTA) method. They were compared in terms of segmentation accuracy, using the Dice score, uncertainty quantification, using the ratio of correct-certain pixels (RCC) and incorrect-uncertain pixels (RIU), and with respect to inter-observer variability in manual segmentation. They were trained with the Kidney and Kidney Tumor Segmentation Challenge launched in 2021 (Kits21), for which multi-class segmentations of kidney, RC, and cyst on 300 CT volumes are available. Moreover, they were tested considering this and other two public renal CT datasets. Accuracy results achieved large differences across the structures of interest for all approaches, with an average Dice score of 0.92, 0.58, and 0.21 for kidney, tumor, and cyst, respectively. In terms of uncertainties, TTA provided the highest uncertainty, followed by Ens and BDC, whereas BDR provided the lowest, and minimized the number of incorrect certain pixels worse than the other approaches. Again, large differences were seen across the three structures in terms of RCC and RIU. These metrics were associated with different risk propensity, as BDR was the most risk-taking approach, able to provide higher accuracy in its prediction, but failing to assign uncertainty on incorrect segmentation in every case. The other three approaches were more conservative, providing large uncertainty regions, with the drawback of giving alert also on correct areas. Finally, the analysis of the inter-observer segmentation variability showed a significant variation among the four approaches on the external dataset, with BDR reporting the lowest agreement (Dice = 0.82), and TTA obtaining the highest score (Dice = 0.94). Our outcomes highlight the importance of quantifying the segmentation uncertainty and that decision-makers can choose the approach most in line with the risk propensity degree required by the application and theirpolicy.
Read full abstract