AbstractConvolutional neural network (CNN) with the encoder–decoder structure is popular in medical image segmentation due to its excellent local feature extraction ability but it faces limitations in capturing the global feature. The transformer can extract the global information well but adapting it to small medical datasets is challenging and its computational complexity can be heavy. In this work, a serial and parallel network is proposed for the accurate 3D medical image segmentation by combining CNN and transformer and promoting feature interactions across various semantic levels. The core components of the proposed method include the cross window self‐attention based transformer (CWST) and multi‐scale local enhanced (MLE) modules. The CWST module enhances the global context understanding by partitioning 3D images into non‐overlapping windows and calculating sparse global attention between windows. The MLE module selectively fuses features by computing the voxel attention between different branch features, and uses convolution to strengthen the dense local information. The experiments on the prostate, atrium, and pancreas MR/CT image datasets consistently demonstrate the advantage of the proposed method over six popular segmentation models in both qualitative evaluation and quantitative indexes such as dice similarity coefficient, Intersection over Union, 95% Hausdorff distance and average symmetric surface distance.
Read full abstract