Recently, Vision Transformers (ViTs) have emerged as a breakthrough in computer vision and image analysis. Still, their exceptional performance depends on the availability of large amounts of annotated training data and the availability of considerable computational resources. This naturally raises the risk of overfitting and limited generalization capability in settings with limited training data. Aiming to cope with this issue, we propose the Level-Set Transformer (TransLevelSet), a hybrid methodology that encompasses an additional term in the loss function used in the ViT learning process, which has been originally defined in the context of Level-Set (LS) deformable models. This loss term avails the spatial information obtained by the level-set energy terms for image segmentation and mitigates the dependency of ViTs on the amount of available data. Moreover, this level-set loss promotes smooth and topologically consistent demarcation of structures, taking advantage of the capacity of ViTs to capture complex spatial relationships and contextual information. The main contributions of this work include: a) a pioneering approach to the integration of ViTs with level-sets; and, b) the application of the proposed methodology on two case studies related to cancer, namely the malignant melanoma, and colon cancer. We evaluate TransLevelSet on three different publicly available benchmark datasets for medical image segmentation, that include dermoscopic and histopathological images. The results of the experiments demonstrate consistent gains in terms of generalization capability introduced by the LS terms.
Read full abstract