Semantic segmentation of volumetric medical images is essential for accurate delineation of anatomic structures and pathology, enabling quantitative analysis in precision medicine applications. While volumetric segmentation has been extensively studied, most existing methods require full supervision and struggle to generalize to new classes at inference time, particularly for irregular, ill-defined targets such as tumors, where fine-grained, high-salience segmentation is required. Consequently, conventional semantic segmentation methods cannot easily offer zero/few-shot generalization to segment objects of interest beyond their closed training set. Foundation models, such as the Segment Anything Model (SAM), have demonstrated promising zero-shot generalization for interactive instance segmentation based on user prompts. However, these models sacrifice semantic knowledge for generalization capabilities that largely rely on collaborative user prompting to inject semantics. For volumetric medical image analysis, a unified approach that combines the semantic understanding of conventional segmentation methods with the flexible, prompt-driven capabilities of SAM is essential for comprehensive anatomical delineation On the one hand, it is natural to exploit anatomic knowledge to enable semantic segmentation without any user interaction. On the other hand, SAM-like approaches to segment unknown classes via prompting provide the needed flexibility to segment structures beyond the closed training set, enabling quantitative analysis. To address these needs in a unified framework, we introduce ProtoSAM-3D, which extends SAMs to semantic segmentation of volumetric data via a novel mask-level prototype prediction approach while retaining the flexibility of SAM Our model utilizes an innovative spatially-aware Transformer to fuse instance-specific intermediate representations from the SAM encoder and decoder, obtaining a comprehensive feature embedding for each mask. These embeddings are then classified by computing similarity with learned prototypes. By predicting prototypes instead of classes directly, ProtoSAM-3D gains the flexibility to rapidly adapt to new classes with minimal retraining. Furthermore, we introduce an auto-prompting method to enable semantic segmentation of known classes without user interaction. We demonstrate state-of-the-art zero/few-shot performance on multi-organ segmentation in CT and MRI. Experimentally, ProtoSAM-3D achieves competitive performance compared to fully supervised methods. Our work represents a step towards interactive semantic segmentation models with SAM for volumetric medical image processing.
Read full abstract