Hybrid CNN-Transformer model for medical image segmentation with pyramid convolution and multi-layer perceptron

Xiaowei Liu,Yikun Hu,Jianguo Chen

doi:10.1016/j.bspc.2023.105331

Abstract

Vision Transformer (ViT) has emerged as a potential alternative to convolutional neural networks for large datasets. However, applying ViT directly to medical image segmentation is challenging due to its lack of induction bias, which requires a large number of high-quality annotated medical images for effective model training. Recent studies have discovered that, in addition to the increased model capacity and generalization resulting from the lack of induction bias, the excellent performance of Transformer can also be attributed to its large receptive field. In this paper, we propose a U-shaped medical image segmentation model that combines large kernel convolutions with Transformers. Specifically, we construct a basic Transformer unit using pyramidal convolution modules with multi-scale kernels and multi-layer perceptron. In the pyramid convolution module, we employ grouped convolution to reduce parameter and computational complexity while utilizing multi-scale large kernel attention as a foundation for more efficient feature extraction. For different types of grouping, different sizes of convolutions are used to enhance the extraction of features with multiple receptive fields. To optimize the extracted features from the encoder, the U-shaped model integrates a variant of the pyramidal convolutional module into the skip connections. This variant utilizes multi-scale large kernel convolutional attention based on channel splitting. The incorporation of this variant enables efficient refinement of the feature representations within the skip connections. Through extensive comparisons on multi-modal medical image datasets, our model outperforms state-of-the-art methods across various evaluation metrics, with notable superiority observed on small-scale medical datasets. Our research findings suggest that the combination of large kernel convolutions and Transformer models introduces an advantageous inductive bias, resulting in enhanced performance specifically for small-scale medical image datasets. To facilitate accessibility, we have made our code openly accessible on our GitHub repository, which can be found at https://github.com/medical-images-process/CNN-Transformer.

Full Text