Classifying real-world macroscopic images in the primary-secondary care interface using transfer learning: implications for development of artificial intelligence solutions using nondermoscopic images.

Jacob Carse,Colin Morton,Emanuele Trucco,Stephen Mckenna,Shareen Muthiah,Charlotte Proby,Gillian Chin,Colin Fleming,Tamás Süveges

doi:10.1093/ced/llad400

Abstract

The application of deep learning (DL) to diagnostic dermatology has been the subject of numerous studies, with some reporting skin lesion classification performance on curated datasets comparable to that of experienced dermatologists. Most skin disease images encountered in clinical settings are macroscopic, without dermoscopic information, and exhibit considerable variability. Further research is necessary to determine the generalizability of DL algorithms across populations and acquisition settings. To assess the extent to which DL can generalize to nondermoscopic datasets acquired at the primary-secondary care interface in the National Health Service (NHS); to explore how to obtain a clinically satisfactory performance on nonstandardized, real-world local data without the availability of large diagnostically labelled local datasets; and to measure the impact of pretraining DL algorithms on external, public datasets. Diagnostic macroscopic image datasets were created from previous referrals from primary to secondary care. These included 2213 images referred from primary care practitioners in NHS Tayside and 1510 images from NHS Forth Valley acquired by medical photographers. Two further datasets with identical diagnostic labels were obtained from sources in the public domain, namely the International Skin Imaging Collaboration (ISIC) dermoscopic dataset and the SD-260 nondermoscopic dataset. DL algorithms, specifically EfficientNets and Self-attention with Window-wise Inner-product based Network (SWIN) transformers, were trained using data from each of these datasets. Algorithms were also fine-tuned on images from the NHS datasets after pretraining on different data combinations, including the larger public-domain datasets. Receiver operating characteristic curves and the area under such curves (AUC) were used to assess performance. SWIN transformers tested on Forth Valley data had AUCs of 0.85 and 0.89 when trained on SD-260 and Forth Valley data, respectively. Training on SD-260 followed by fine tuning of Forth Valley data gave an AUC of 0.91. Similar effects of pretraining and tuning on local data were observed using Tayside data and EfficientNets. Pretraining on the larger dermoscopic image dataset (ISIC 2019) provided no additional benefit. Pretraining on public macroscopic images followed by tuning to local data gave promising results. Further improvements are needed to afford deployment in real clinical pathways. Larger datasets local to the target domain might be expected to yield further improved performance.

Full Text