The accurate and efficient segmentation of the spine is important in the diagnosis and treatment of spine malfunctions and fractures. However, it is still challenging because of large inter-vertebra variations in shape and cross-image localization of the spine. In previous methods, convolutional neural networks (CNNs) have been widely applied as a vision backbone to tackle this task. However, these methods are challenged in utilizing the global contextual information across the whole image for accurate spine segmentation because of the inherent locality of the convolution operation. Compared with CNNs, the Vision Transformer (ViT) has been proposed as another vision backbone with a high capacity to capture global contextual information. However, when the ViT is employed for spine segmentation, it treats all input tokens equally, including vertebrae-related tokens and non-vertebrae-related tokens. Additionally, it lacks the capability to locate regions of interest, thus lowering the accuracy of spine segmentation. To address this limitation, we propose a novel Vertebrae-aware Vision Transformer (VerFormer) for automatic spine segmentation from CT images. Our VerFormer is designed by incorporating a novel Vertebrae-aware Global (VG) block into the ViT backbone. In the VG block, the vertebrae-related global contextual information is extracted by a Vertebrae-aware Global Query (VGQ) module. Then, this information is incorporated into query tokens to highlight vertebrae-related tokens in the multi-head self-attention module. Thus, this VG block can leverage global contextual information to effectively and efficiently locate spines across the whole input, thus improving the segmentation accuracy of VerFormer. Driven by this design, the VerFormer demonstrates a solid capacity to capture more discriminative dependencies and vertebrae-related context in automatic spine segmentation. The experimental results on two spine CT segmentation tasks demonstrate the effectiveness of our VG block and the superiority of our VerFormer in spine segmentation. Compared with other popular CNN- or ViT-based segmentation models, our VerFormer shows superior segmentation accuracy and generalization.
Read full abstract