SS-CXR: Self-Supervised Pretraining Using Chest X-Rays Towards A Domain Specific Foundation Model

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Chest X-rays (CXRs) are widely used imaging modality for the diagnosis and prognosis of lung disease. There is a large body of work where machine learning algorithms are developed for specific tasks. However, the traditional diagnostic tool design methods based on supervised learning are burdened by the need to provide training data annotation, which should be of good quality for better clinical outcomes. Here, we propose an alternative solution, a new self-supervised paradigm, where a general representation from CXRs is learned using a group-masked self-supervised framework. The pre-trained model is then fine-tuned for domain-specific tasks such as covid-19, pneumonia detection, and general health screening. We show that the same pre-training can be used for the lung segmentation task. Our proposed paradigm shows robust performance in multiple downstream tasks which demonstrates the success of the pre-training. Moreover, the performance of the pre-trained models on data with significant drift during test time proves the learning of a better generic representation. The methods are further validated by covid-19 detection in a unique small-scale pediatric data set. The performance gain ($\sim 25 \%$) is significant when compared to a supervised transformer-based method. This adds credence to the strength and reliability of our proposed framework and pre-training strategy.

Similar Papers
  • Research Article
  • Cite Count Icon 22
  • 10.1186/s41747-023-00411-3
Enhancing diagnostic deep learning via self-supervised pretraining on large-scale, unlabeled non-medical images
  • Feb 8, 2024
  • European Radiology Experimental
  • Soroosh Tayebi Arasteh + 4 more

BackgroundPretraining labeled datasets, like ImageNet, have become a technical standard in advanced medical image analysis. However, the emergence of self-supervised learning (SSL), which leverages unlabeled data to learn robust features, presents an opportunity to bypass the intensive labeling process. In this study, we explored if SSL for pretraining on non-medical images can be applied to chest radiographs and how it compares to supervised pretraining on non-medical images and on medical images.MethodsWe utilized a vision transformer and initialized its weights based on the following: (i) SSL pretraining on non-medical images (DINOv2), (ii) supervised learning (SL) pretraining on non-medical images (ImageNet dataset), and (iii) SL pretraining on chest radiographs from the MIMIC-CXR database, the largest labeled public dataset of chest radiographs to date. We tested our approach on over 800,000 chest radiographs from 6 large global datasets, diagnosing more than 20 different imaging findings. Performance was quantified using the area under the receiver operating characteristic curve and evaluated for statistical significance using bootstrapping.ResultsSSL pretraining on non-medical images not only outperformed ImageNet-based pretraining (p < 0.001 for all datasets) but, in certain cases, also exceeded SL on the MIMIC-CXR dataset. Our findings suggest that selecting the right pretraining strategy, especially with SSL, can be pivotal for improving diagnostic accuracy of artificial intelligence in medical imaging.ConclusionsBy demonstrating the promise of SSL in chest radiograph analysis, we underline a transformative shift towards more efficient and accurate AI models in medical imaging.Relevance statementSelf-supervised learning highlights a paradigm shift towards the enhancement of AI-driven accuracy and efficiency in medical imaging. Given its promise, the broader application of self-supervised learning in medical imaging calls for deeper exploration, particularly in contexts where comprehensive annotated datasets are limited.Graphical

  • Conference Article
  • Cite Count Icon 97
  • 10.1109/wacv51458.2022.00112
Self-Supervised Pretraining Improves Self-Supervised Pretraining
  • Jan 1, 2022
  • Colorado J Reed + 11 more

While self-supervised pretraining has proven beneficial for many computer vision tasks, it requires expensive and lengthy computation, large amounts of data, and is sensitive to data augmentation. Prior work demonstrates that models pretrained on datasets dissimilar to their target data, such as chest X-ray models trained on ImageNet, underperform models trained from scratch. Users that lack the resources to pretrain must use existing models with lower performance. This paper explores Hierarchical PreTraining (HPT), which decreases convergence time and improves accuracy by initializing the pretraining process with an existing pretrained model. Through experimentation on 16 diverse vision datasets, we show HPT converges up to 80× faster, improves accuracy across tasks, and improves the robustness of the self-supervised pretraining process to changes in the image augmentation policy or amount of pretraining data. Taken together, HPT provides a simple framework for obtaining better pretrained representations with less computational resources.

  • Research Article
  • Cite Count Icon 4
  • 10.1038/s41598-024-74043-x
Multimodal masked siamese network improves chest X-ray representation learning
  • Sep 28, 2024
  • Scientific Reports
  • Saeed Shurrab + 2 more

Self-supervised learning methods for medical images primarily rely on the imaging modality during pretraining. Although such approaches deliver promising results, they do not take advantage of the associated patient or scan information collected within Electronic Health Records (EHR). This study aims to develop a multimodal pretraining approach for chest radiographs that considers EHR data incorporation as an additional modality that during training. We propose to incorporate EHR data during self-supervised pretraining with a Masked Siamese Network (MSN) to enhance the quality of chest radiograph representations. We investigate three types of EHR data, including demographic, scan metadata, and inpatient stay information. We evaluate the multimodal MSN on three publicly available chest X-ray datasets, MIMIC-CXR, CheXpert, and NIH-14, using two vision transformer (ViT) backbones, specifically ViT-Tiny and ViT-Small. In assessing the quality of the representations through linear evaluation, our proposed method demonstrates significant improvement compared to vanilla MSN and state-of-the-art self-supervised learning baselines. In particular, our proposed method achieves an improvement of of 2% in the Area Under the Receiver Operating Characteristic Curve (AUROC) compared to vanilla MSN and 5% to 8% compared to other baselines, including uni-modal ones. Furthermore, our findings reveal that demographic features provide the most significant performance improvement. Our work highlights the potential of EHR-enhanced self-supervised pretraining for medical imaging and opens opportunities for future research to address limitations in existing representation learning methods for other medical imaging modalities, such as neuro-, ophthalmic, and sonar imaging.

  • Research Article
  • Cite Count Icon 1
  • 10.1109/jbhi.2024.3505303
Learning Consistent Semantic Representation for Chest X-ray via Anatomical Localization in Self-Supervised Pre-Training.
  • Mar 1, 2025
  • IEEE journal of biomedical and health informatics
  • Surong Chu + 7 more

Despite the similar global structures in Chest X-ray (CXR) images, the same anatomy exhibits varying appearances across images, including differences in local textures, shapes, colors, etc. Learning consistent representations for anatomical semantics through these diverse appearances poses a great challenge for self-supervised pre-training in CXR images. To address this challenge, we propose two new pre-training tasks: inner-image anatomy localization (IIAL) and cross-image anatomy localization (CIAL). Leveraging the relatively stable positions of identical anatomy across images, we utilize position information directly as supervision to learn consistent semantic representations. Specifically, IIAL adopts a coarse-to-fine heatmap localization approach to correlate anatomical semantics with positions, while CIAL leverages feature affine alignment and heatmap localization to establish a correspondence between identical anatomical semantics across varying images, despite their appearance diversity. Furthermore, we introduce a unified end-to-end pre-training framework, anatomy-aware representation learning (AARL), integrating IIAL, CIAL, and a pixel restoration task. The advantages of AARL are: 1) preserving the appearance diversity and 2) training in a simple end-to-end way avoiding complicated preprocessing. Extensive experiments on six downstream tasks, including classification and segmentation tasks in various application scenarios, demonstrate that our AARL: 1) has more powerful representation and transferring ability; 2) is annotation-efficient, reducing the demand for labeled data and 3) improves the sensitivity to detecting various pathological and anatomical patterns.

  • Research Article
  • Cite Count Icon 7
  • 10.1007/978-3-031-44992-5_8
Improving Medical Image Classification in Noisy Labels Using only Self-supervised Pretraining.
  • Jan 1, 2023
  • Data engineering in medical imaging : first MICCAI Workshop, DEMI 2023, Held in Conjunction with MICCAI 2023, Vancouver, BC, Canada, October 8, 2023, Proceedings. DEMI (Workshop) (1st : 2023 : Vancouver, B.C.)
  • Bidur Khanal + 3 more

Noisy labels hurt deep learning-based supervised image classification performance as the models may overfit the noise and learn corrupted feature extractors. For natural image classification training with noisy labeled data, model initialization with contrastive self-supervised pretrained weights has shown to reduce feature corruption and improve classification performance. However, no works have explored: i) how other self-supervised approaches, such as pretext task-based pretraining, impact the learning with noisy label, and ii) any self-supervised pretraining methods alone for medical images in noisy label settings. Medical images often feature smaller datasets and subtle inter-class variations, requiring human expertise to ensure correct classification. Thus, it is not clear if the methods improving learning with noisy labels in natural image datasets such as CIFAR would also help with medical images. In this work, we explore contrastive and pretext task-based selfsupervised pretraining to initialize the weights of a deep learning classification model for two medical datasets with self-induced noisy labels-NCT-CRC-HE-100K tissue histological images and COVID-QU-Ex chest X-ray images. Our results show that models initialized with pretrained weights obtained from self-supervised learning can effectively learn better features and improve robustness against noisy labels.

  • Research Article
  • 10.5194/isprs-annals-x-2-w2-2025-31-2025
Ending Overfitting for UAV Applications - Self-Supervised Pretraining on Multispectral UAV Data
  • Oct 29, 2025
  • ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences
  • Jurrian Doornbos + 1 more

Abstract. While UAVs have revolutionized data collection for remote sensing, the practical application of Deep Learning remains severely limited by the scarcity of labelled training data, creating a stark contrast between laboratory successes and field performance. This research investigates whether transfer learning techniques can overcome this "small data problem" by enabling UAV-based deep learning models to generalize effectively across diverse environments without requiring prohibitive amounts of labelled examples. We present the use of an efficient self-supervised learning framework (FastSiam) tailored specifically for multispectral UAV imagery to overcome this generalization gap. Our approach enables effective feature learning without requiring extensive labelled data, bridging the gap between the potential of foundation models and the resource constraints of UAV remote sensing applications. We evaluate our method on a vineyard segmentation task across multiple geographic locations, demonstrating that models with FastSiam pretrained backbones significantly outperform their end-to-end trained counterparts, even with extremely limited labelled data. The most sophisticated architecture tested, Swin-T with a pretrained backbone, achieved an average F1 score of 0.80 across diverse test sites, showcasing robust generalization capabilities. Importantly, our results show that pretrained models benefit more from diversity in training samples than from sheer volume, suggesting new pathways for efficient model development in UAV applications. This work establishes that self-supervised pretraining serves as an effective regularizer for remote sensing tasks. Pretraining limits overfitting and improves generalization across varying environmental conditions, whilst requiring only modest computational resources, making advanced Deep Learning techniques more accessible for practical UAV applications.

  • Research Article
  • 10.1007/s00292-025-01429-7
Foundation models in pathology
  • Apr 24, 2025
  • Pathologie (Heidelberg, Germany)
  • Frederick Klauschen + 2 more

Foundation models prepare neural networks for applications in specific domains, such as speech applications or image analysis, through self-supervised pretraining. These models can be adapted for specific applications, such as histopathological diagnostics. While adaptation still requires supervised training, AI applications based on foundation models achieve significantly better prediction accuracy with fewer training data compared to conventional approaches. This article introduces the topic and provides an overview of foundation models in pathology.

  • Research Article
  • Cite Count Icon 19
  • 10.1109/tnnls.2025.3554755
Hypergraph Foundation Model for Brain Disease Diagnosis.
  • Oct 1, 2025
  • IEEE transactions on neural networks and learning systems
  • Xiangmin Han + 6 more

The goal of the hypergraph foundation model (HGFM) is to learn an encoder based on the hypergraph computational paradigm through self-supervised pretraining on high-order correlation structures, enabling the encoder to rapidly adapt to various downstream tasks in scenarios, where no labeled data or only a small amount of labeled data are available. The initial exploratory work has been applied to brain disease diagnosis tasks. However, existing methods primarily rely on graph-based approaches to learn low-order correlation patterns between brain regions in brain networks, neglecting the modeling and learning of complex correlations between different brain diseases and patients. This article proposes an HGFM for brain disease diagnosis, which conducts multidimensional pretraining tasks to explore latent cross-dimensional high-order correlation patterns on various brain disease datasets. HGFM is a high-order correlation-driven foundation model for brain disease diagnosis and effectively improves prediction performance. Specifically, HGFM first performs brain functional network link prediction tasks on individual brain networks and group interaction network link prediction tasks on group brain networks, constructing an HGFM for brain disease diagnosis. In downstream tasks, it achieves predictions for different brain disease diagnosis tasks through few-shot learning fine-tuning methods. The proposed method is evaluated on functional magnetic resonance imaging (fMRI) data from 4409 patients across four brain diseases. Results show that it outperforms existing state-of-the-art methods in all brain disease diagnosis tasks, demonstrating its potential value in clinical applications.

  • Research Article
  • 10.1101/2025.08.21.25334170
Comparison of Foundation and Supervised Learning-Based Models for Detection of Referable Glaucoma from Fundus Photographs
  • Aug 24, 2025
  • medRxiv
  • Kyle Bolo + 10 more

Purpose:To compare the performance of a foundation model and a supervised learning-based model for detecting referable glaucoma from fundus photographs.Design:Evaluation of diagnostic technology.Participants:6,116 participants from the Los Angeles County Department of Health Services Teleretinal Screening Program.Methods:Fundus photographs were labeled for referable glaucoma (cup-to-disc ratio ≥ 0.6) by certified optometrists. Four deep learning models were trained on cropped and uncropped images (Training N = 8,996; Validation N = 3,002) using two architectures: a vision transformer with self-supervised pretraining on fundus photographs (RETFound) and a convolutional neural network (VGG-19). Models were evaluated on a held-out test set (N = 1,000) labeled by glaucoma specialists and an external test set (N = 300) from University of Southern California clinics. Performance was assessed while varying training set size and stratifying by demographic factors. xRAI was used for saliency mapping.Main Outcome Measures:Area under the receiver operating characteristic curve (AUC-ROC) and threshold-specific metrics.Results:The cropped image VGG-19 model achieved the highest AUC-ROC (0.924 [0.907–0.940]), which was comparable (p = 0.07) to the cropped image RETFound model (0.911 [0.892–0.930]), which achieved the highest Youden-optimal performance (sensitivity 82.6%, specificity 88.2%) and F1 score (0.801). Cropped image models outperformed their uncropped counterparts within each architecture (p < 0.001 for AUC-ROC comparisons). RETFound models had a performance advantage when trained on smaller datasets (N < 2000 images), and the uncropped image RETFound model performed best on external data (p < 0.001 for AUC-ROC comparisons). The cropped image RETFound model performed consistently across ethnic groups (p = 0.20), while the others did not (p < 0.04); performance did not vary by age or gender. Saliency maps for both architectures consistently included the optic nerve.Conclusion:While both RETFound and VGG-19 models performed well for classification of referable glaucoma, foundation models may be preferable when training data is limited and when domain shift is expected. Training models using images cropped to the region of the optic nerve improves performance regardless of architecture but may reduce model generalizability.

  • Research Article
  • 10.1161/circ.152.suppl_3.4370718
Abstract 4370718: Transformer-based ECG beat foundation model reconstructs full 12-Lead morphology, vectorcardiogram and predicts peak heart rate in stress ECG
  • Nov 4, 2025
  • Circulation
  • Sabyasachi Bandyopadhyay + 10 more

Background: Regular monitoring of performance in ECG stress tests can enable early detection of subtle conduction/morphological changes and enable more accurate risk stratification. However, repeated stress ECGs are impossible in high-risk patients including those with severe stenosis, recent surgery or significant arrhythmia burden. An ECG foundational model capable of reconstructing 12-lead ECGs from a single lead typically available from wearables (e.g., apple watch: lead I) can create ambulatory stress ECG tests which obviate this problem. Hypothesis: We hypothesized that a self-supervised transformer model pretrained on reconstructing 11 masked leads using lead I can learn latent features for predicting peak heart rate (HR) across exercise stages and synthesize vectorcardiograms (VCG) for risk stratification in stress ECGs. Methods: We collected 7,625 stress test records from a single institution, from which 7,453 samples were included. This was divided into 4,447 training, 759 validation and 2,247 test ECGs which were used to develop a 6-layer transformer encoder architecture. A transposed-convolutional decoder with skip connection was used to reconstruct the masked leads while auxiliary linear layers regressed on VCG obtained using Dower transform and peak HR. A contrastive regularization loss was used to organize the latent space by reducing the distance between beats belonging to the same patient. The model was first trained solely on the reconstruction task (self-supervised pretraining) for 20 epochs, following which the decoder was frozen, and the encoder + auxiliary heads were supervised fine-tuned for 60 epochs to learn peak HR and VCG reconstructions. Training was performed with batch size = 32 and learning rate = 3x10 -3 during pretraining followed by 3x10 -4 during fine-tuning. Results: The model achieved A) a reconstruction mean squared error (MSE) of 0.16 mv2 on the masked leads, B) a R of 0.73 on peak HR regression, AUC = 0.82, AUPRC = 0.9 on high (&gt; 120 bpm) peak HR classification, C) and Pearson R of 0.96, 0.95 and 0.98 on x, y, z axes of VCG in the held-out test dataset. (Fig 1) Conclusion: We are able to faithfully reconstruct 12-lead beat morphology from lead I which was valid across ST segments, QRS complexes and PR intervals. This self-supervised pretraining step was applicable in creating ambulatory, morphology aware stress ECG indices for a large hold-out test set.

  • Research Article
  • Cite Count Icon 3
  • 10.1007/s44267-025-00085-y
DASFormer: self-supervised pretraining for earthquake monitoring.
  • Jul 15, 2025
  • Visual intelligence
  • Qianggang Ding + 3 more

Earthquake monitoring is a fundamental task to unravel the underlying physics of earthquakes and mitigate associated hazards for public safety. Distributed acoustic sensing, or DAS, which transforms pre-existing telecommunication cables into ultra-dense seismic networks, offers a cost-effective and scalable solution for next-generation earthquake monitoring. However, current approaches for earthquake monitoring like PhaseNet and PhaseNet-2 primarily rely on supervised learning, while manually labeled DAS data is quite limited and it is difficult to obtain more annotated datasets. In this paper, we present DASFormer, a novel self-supervised pretraining technique on DAS data with a coarse-to-fine framework that models spatial-temporal signal correlation. We treat earthquake monitoring as an anomaly detection task and demonstrate DASFormer can be directly utilized as a seismic phase detector. Experimental results demonstrate that DASFormer is effective in terms of several evaluation metrics and outperforms state-of-the-art time-series forecasting, anomaly detection, and foundation models on the unsupervised seismic detection task. We also demonstrate the potential of fine-tuning DASFormer to downstream tasks through case studies.

  • Research Article
  • 10.1016/j.xops.2025.101008
Comparison of RETFound and a Supervised Convolutional Neural Network for Detection of Referable Glaucoma from Fundus Photographs.
  • Feb 1, 2026
  • Ophthalmology science
  • Kyle Bolo + 10 more

To compare the performance of a vision transformer-based foundation model (RETFound) and a supervised convolutional neural network (VGG-19) for detecting referable glaucoma from fundus photographs. An evaluation of diagnostic technology. Six thousand one hundred sixteen participants from the Los Angeles County Department of Health Services Teleretinal Screening Program. Fundus photographs were labeled for referable glaucoma (cup-to-disc ratio ≥0.6) by certified optometrists. Four deep learning models were trained on cropped and uncropped images (training N = 8996; validation N = 3002) using 2 architectures: RETFound, a vision transformer with self-supervised pretraining on fundus photographs, and VGG-19. Models were evaluated on a held-out test set (N = 1000) labeled by glaucoma specialists and an external test set (N = 300) from University of Southern California clinics. Performance was assessed while varying training set size and stratifying by demographic factors. xRAI was used for saliency mapping. Area under the receiver operating characteristic curve (AUC-ROC) and threshold-specific metrics. The cropped image VGG-19 model achieved the highest AUC-ROC (0.924 [0.907-0.940]), which was comparable (P = 0.07) to the cropped image RETFound model (0.911 [0.892-0.930]), which achieved the highest Youden-optimal performance (sensitivity 82.6% and specificity 88.2%) and F1 score (0.801). Cropped image models outperformed their uncropped counterparts (RETFound 0.889 [0.868-0.909], VGG-19 0.898 [0.879-0.917]) within each architecture (P < 0.001 for AUC-ROC comparisons). The uncropped image RETFound model performed best on external data (0.886 [0.849-0.924] vs. the next-highest 0.797 [0.746-0.848], P < 0.001 for AUC-ROC comparisons). RETFound models had a performance advantage when trained on smaller datasets (N < 2000 images), and the cropped image RETFound model performed consistently across ethnic groups (P = 0.20), whereas the others did not (P < 0.04). Performance did not vary by age or gender. Saliency maps for both architectures consistently included the optic nerve. Although both RETFound and VGG-19 models performed well for classification of referable glaucoma, foundation models may be preferable when training data are limited and when domain shift is expected. Training models using images cropped to the region of the optic nerve improves performance regardless of architecture but may reduce model generalizability. Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.

  • Book Chapter
  • Cite Count Icon 3
  • 10.1007/978-3-030-89698-0_102
Self-supervised Pretraining for Covid-19 and Other Pneumonia Detection from Chest X-ray Images
  • Jan 1, 2022
  • Yulong Hao + 2 more

Artificial intelligence technology has made breakthroughs in computer vision and natural language processing in recent years. An important factor is that the technology analyzes tasks in a data-driven manner and automatically learns data representation from large representations of data sets for a special task. However, one of the challenges is the lacked enough labelled datasets for pneumonia detection from chest X-ray images, which usually has a small number of identically distributed labelled data for training and conflicts with data-driven deep learning. That also is the bottleneck of the development of medical imaging AI. To address this challenge, we propose a self-supervised pre-training method for Covid-19 and other pneumonia detection. The method includes pre-trained model training and transfer learning. The pre-trained model uses a self-supervised contrastive learning method to learn the general representations from source data with location-sensitive patches and multi-level features. Transfer learning includes three stages of training to specialize the representation from source data to target data. The experiments show that it has improved performance for Covid-19 detection and other pneumonia with few labelled data.KeywordsSelf-supervisedObject detectionContrastive learningTransfer learningCovid-19Chest radiographs

  • Conference Article
  • Cite Count Icon 579
  • 10.1109/iccv48922.2021.00346
Big Self-Supervised Models Advance Medical Image Classification
  • Oct 1, 2021
  • Shekoofeh Azizi + 11 more

Self-supervised pretraining followed by supervised fine-tuning has seen success in image recognition, especially when labeled examples are scarce, but has received limited attention in medical image analysis. This paper studies the effectiveness of self-supervised learning as a pre-training strategy for medical image classification. We conduct experiments on two distinct tasks: dermatology condition classification from digital camera images and multi-label chest X-ray classification, and demonstrate that self-supervised learning on ImageNet, followed by additional self-supervised learning on unlabeled domain-specific medical images significantly improves the accuracy of medical image classifiers. We introduce a novel Multi-Instance Contrastive Learning (MICLe) method that uses multiple images of the underlying pathology per patient case, when available, to construct more informative positive pairs for self-supervised learning. Combining our contributions, we achieve an improvement of 6.7% in top-1 accuracy and an improvement of 1.1% in mean AUC on dermatology and chest X-ray classification respectively, outperforming strong supervised baselines pretrained on ImageNet. In addition, we show that big self-supervised models are robust to distribution shift and can learn efficiently with a small number of labeled medical images.

  • Research Article
  • 10.3390/diagnostics16030440
Transformer-Based Foundation Learning for Robust and Data-Efficient Skin Disease Imaging.
  • Feb 1, 2026
  • Diagnostics (Basel, Switzerland)
  • Inzamam Mashood Nasir + 3 more

Background/Objectives: Accurate and reliable automated dermoscopic lesion classification remains challenging. This is due to pronounced dataset bias, limited expert-annotated data, and poor cross-dataset generalization of conventional supervised deep learning models. In clinical dermatology, these limitations restrict the deployment of data-driven diagnostic systems across diverse acquisition settings and patient populations. Methods: Motivated by these challenges, this study proposes a transformer-based, dermatology-specific foundation model. The model learns transferable visual representations from large collections of unlabeled dermoscopic images via self-supervised pretraining. It integrates large-scale dermatology-oriented self-supervised learning with a hierarchical vision transformer backbone. This enables effective capture of both fine-grained lesion textures and global morphological patterns. The evaluation is conducted across three publicly available dermoscopic datasets: ISIC 2018, HAM10000, and PH2. The study assesses in-dataset, cross-dataset, limited-label, ablation, and computational-efficiency settings. Results: The proposed approach achieves in-dataset classification accuracies of 94.87%, 97.32%, and 98.17% on ISIC 2018, HAM10000, and PH2, respectively. It outperforms strong transformer and hybrid baselines. Cross-dataset transfer experiments show consistent performance gains of 3.5-5.8% over supervised counterparts. This indicates improved robustness to domain shift. Furthermore, when fine-tuned with only 10% of the labeled training data, the model achieves performance comparable to fully supervised baselines. Conclusions: This highlights strong data efficiency. These results demonstrate that dermatology-specific foundation learning offers a principled and practical solution for robust dermoscopic lesion classification under realistic clinical constraints.

Save Icon
Up Arrow
Open/Close