Accurate segmentation of 3D medical images is vital for computer-aided diagnosis. However, the complexity of target morphological variations and a scarcity of labeled data make segmentation more challenging. Furthermore, existing models make it difficult to fully and efficiently integrate global and local information, which hinders structured knowledge acquisition. To overcome these challenges, we introduce the TNT Masking Network (TNT-MNet), a groundbreaking transformer-based 3D model that utilizes a transformer-in-transformer (TNT) encoder. For the first time, we present masked image modeling (MIM) in supervised learning, utilizing target boundary regions as masked prediction targets to enhance structured knowledge acquisition. We execute multiscale random masking on inner and outer tokens in online branch to tackle the challenge of segmenting organs and lesion regions with varying structures at multiple scales and to enhance modeling capabilities. In contrast, the target branch utilizes all tokens to guide the online branch to reconstruct the masked tokens. Our experiments suggest that TNT-MNet’s performance is comparable, or even better, than state-of-the-art models in three medical image datasets (BTCV, LiTS2017, and BraTS2020) and effectively reduces the dependence on labeled data. The code and models are publicly available at https://github.com/changliu-work/TNT_MNet.
Read full abstract