Abstract Recent advancements in CNN and Transformer-based models have significantly advanced medical image segmentation. CNN-based models are less effective at capturing global contexts than Transformer-based models and Transformers have structural disadvantages that limit their ability to capture detailed spatial information. In this paper, we introduce an Dual-Encoder Fusion Model that incorporates a novel Correlation Fusion Module (CFM) for medical image segmentation tasks. This model leverages the strengths of Convolutional Neural Networks (CNN) for local context modeling and Transformers for comprehending long-range dependencies in pixel interactions. Experimental results demonstrate a substantial improvement over existing models on the Synapse dataset, achieving enhancements of 2.28% and 3.47% on the dice metric for Aorta and Pancreas organs respectively. Additionally, our model attains the highest mean HD95 score of 9.05 on the Synapse dataset while utilizing fewer parameters. When evaluated on the MSD datasets, our model outperforms a fine-tuned nnUNet in three out of five tumor detection tasks and maintains competitive performance in three out of four organ boundary delineation tasks. Notably, on the MSD-Lung dataset, our model surpasses a fine-tuned nnUNet on the dice metric by 6.4%. These results underscore the effectiveness of the CFM module within the dual-encoder architecture.
Read full abstract