The hybrid CNN-transformer structures harness the global contextualization of transformers with the local feature acuity of CNNs, propelling medical image segmentation to the next level. However, the majority of research has focused on the design and composition of hybrid structures, neglecting the data structure, which enhance segmentation performance, optimize resource efficiency, and bolster model generalization and interpretability. In this work, we propose a data-oriented octree inverse hierarchical order aggregation hybrid transformer-CNN (nnU-OctTN), which focuses on delving deeply into the data itself to identify and harness potential. The nnU-OctTN employs the U-Net as a foundational framework, with the node aggregation transformer serving as the encoder. Data features are stored within an octree data structure with each node computed autonomously yet interconnected through a block-to-block local information exchange mechanism. Oriented towards multi-resolution feature data map learning, a cross-fusion module has been designed that associates the encoder and decoder in a staggered vertical and horizontal approach. Inspired by nnUNet, our framework automatically adapts network parameters to the dataset instead of using pre-trained weights for initialization. The nnU-OctTN method was evaluated on the BTCV, ACDC, and BraTS datasets and achieved excellent performance with dice score coefficient (DSC) 86.95, 92.82, and 90.61, respectively, demonstrating its generalizability and effectiveness. Cross-fusion module effectiveness and model scalability are validated through ablation experiments on BTCV and Kidney. Extensive qualitative and quantitative experimental results demonstrate that nnU-OctTN achieves high-quality 3D medical segmentation that has competitive performance against current state-of-the-art methods, providing a promising idea for clinical applications.
Read full abstract