Background Convolutional neural networks (CNNs) are regarded as state-of-the-art artificial intelligence (AI) tools for dermatological diagnosis, and they have been shown to achieve expert-level performance when trained on a representative dataset. CNN explainability is a key factor to adopting such techniques in practice and can be achieved using attention maps of the network. However, evaluation of CNN explainability has been limited to visual assessment and remains qualitative, subjective, and time consuming. Objective This study aimed to provide a framework for an objective quantitative assessment of the explainability of CNNs for dermatological diagnosis benchmarks. Methods We sourced 566 images available under the Creative Commons license from two public datasets—DermNet NZ and SD-260, with reference diagnoses of acne, actinic keratosis, psoriasis, seborrheic dermatitis, viral warts, and vitiligo. Eight dermatologists with teledermatology expertise annotated each clinical image with a diagnosis, as well as diagnosis-supporting characteristics and their localization. A total of 16 supporting visual characteristics were selected, including basic terms such as macule, nodule, papule, patch, plaque, pustule, and scale, and additional terms such as closed comedo, cyst, dermatoglyphic disruption, leukotrichia, open comedo, scar, sun damage, telangiectasia, and thrombosed capillary. The resulting dataset consisted of 525 images with three rater annotations for each. Explainability of two fine-tuned CNN models, ResNet-50 and EfficientNet-B4, was analyzed with respect to the reference explanations provided by the dermatologists. Both models were pretrained on the ImageNet natural image recognition dataset and fine-tuned using 3214 images of the six target skin conditions obtained from an internal clinical dataset. CNN explanations were obtained as activation maps of the models through gradient-weighted class-activation maps. We computed the fuzzy sensitivity and specificity of each characteristic attention map with regard to both the fuzzy gold standard characteristic attention fusion masks and the fuzzy union of all characteristics. Results On average, explainability of EfficientNet-B4 was higher than that of ResNet-50 in terms of sensitivity for 13 of 16 supporting characteristics, with mean values of 0.24 (SD 0.07) and 0.16 (SD 0.05), respectively. However, explainability was lower in terms of specificity, with mean values of 0.82 (SD 0.03) and 0.90 (SD 0.00) for EfficientNet-B4 and ResNet-50, respectively. All measures were within the range of corresponding interrater metrics. Conclusions We objectively benchmarked the explainability power of dermatological diagnosis models through the use of expert-defined supporting characteristics for diagnosis. Acknowledgments This work was supported in part by the Danish Innovation Fund under Grant 0153-00154A. Conflict of Interest None declared.
Read full abstract