Large foundation models, such as the Segment Anything Model (SAM), have shown remarkable performance in image segmentation tasks. However, the optimal approach to achieve true utility of these models for domain-specific applications, such as medical image segmentation, remains an open question. Recent studies have released a medical version of the foundation model MedSAM by training on vast medical data, who promised SOTA medical segmentation. Independent community inspection and dissection is needed. Foundation models are developed for general purposes. On the other hand, stable delivery of reliable performance is key to clinical utility. This study aims at elucidating the potential advantage and limitations of landing the foundation models in clinical use by assessing the performance of off-the-shelf medical foundation model MedSAM for the segmentation of anatomical structures in pelvic MR images. We also explore the simple remedies by evaluating the dependency on prompting scheme. Finally, we demonstrate the need and performance gain of further specialized fine-tuning. MedSAM and its lightweight version LiteMedSAM were evaluated out-of-the-box on a public MR dataset consisting of 589 pelvic images split 80:20 for training and testing. An nnU-Net model was trained from scratch to serve as a benchmark and to provide bounding box prompts for MedSAM. MedSAM was evaluated using different quality bounding boxes, those derived from ground truth labels, those derived from nnU-Net, and those derived from the former two but with 5-pixel isometric expansion. Lastly, LiteMedSAM was refined on the training set and reevaluated on this task. Out-of-the-box MedSAM and LiteMedSAM both performed poorly across the structure set, especially for disjoint or non-convex structures. Varying prompt with different bounding box inputs had minimal effect. For example, the mean Dice score and mean Hausdorff distances (in mm) for obturator internus using MedSAM and LiteMedSAM were {0.251±0.110, 0.101±0.079} and {34.142±5.196, 33.688±5.306}, respectively. Fine-tuning of LiteMedSAM led to significant performance gain, improving Dice score and Hausdorff distance for the obturator internus to 0.864±0.123 and 5.022±10.684, on par with nnU-Net with no significant difference in evaluation of most structures. All segmentation structures benefited significantly from specialized refinement, at varying improvement margin. While our study alludes to the potential of deep learning models like MedSAM and LiteMedSAM for medical segmentation, it highlights the need for specialized refinement and adjudication. Off-the-shelf use of such large foundation models is highly likely to be suboptimal, and specialized fine-tuning is often necessary to achieve clinical desired accuracy and stability.
Read full abstract