Medical image segmentation provides a reliable basis for diagnosis analysis and disease treatment by capturing the global and local features of the target region. To learn global features, convolutional neural networks are replaced with pure transformers, or transformer layers are stacked at the deepest layers of convolutional neural networks. Nevertheless, they are deficient in exploring local-global cues at each scale and the interaction among consensual regions in multiple scales, hindering the learning about the changes in size, shape, and position of target objects. To cope with these defects, we propose a novel Intra and Inter Attention with Mutual Consistency Learning Network (IIAM). Concretely, we design an intra attention module to aggregate the CNN-based local features and transformer-based global information on each scale. In addition, to capture the interaction among consensual regions in multiple scales, we devise an inter attention module to explore the cross-scale dependency of the object and its surroundings. Moreover, to reduce the impact of blurred regions in medical images on the final segmentation results, we introduce multiple decoders to estimate the model uncertainty, where we adopt a mutual consistency learning strategy to minimize the output discrepancy during the end-to-end training and weight the outputs of the three decoders as the final segmentation result. Extensive experiments on three benchmark datasets verify the efficacy of our method and demonstrate superior performance of our model to state-of-the-art techniques.
Read full abstract