Background: While skin cancers are less prevalent in people with skin of color, they are more often diagnosed at later stages and have a poorer prognosis. The use of artificial intelligence (AI) models can potentially improve early detection of skin cancers; however, the lack of skin color diversity in training datasets may only widen the pre-existing racial discrepancies in dermatology. Objective: The aim of this study was to systematically review the technique, quality, accuracy, and implications of studies using AI models trained or tested in populations with skin of color for classification of pigmented skin lesions. Methods: PubMed was used to identify any studies describing AI models for classification of pigmented skin lesions. Only studies that used training datasets with at least 10% of images from people with skin of color were eligible. Outcomes on study population, design of AI model, accuracy, and quality of the studies were reviewed. Results: Twenty-two eligible articles were identified. The majority of studies were trained on datasets obtained from Chinese (7/22), Korean (5/22), and Japanese populations (3/22). Seven studies used diverse datasets containing Fitzpatrick skin type I–III in combination with at least 10% from black Americans, Native Americans, Pacific Islanders, or Fitzpatrick IV–VI. AI models producing binary outcomes (e.g., benign vs. malignant) reported an accuracy ranging from 70% to 99.7%. Accuracy of AI models reporting multiclass outcomes (e.g., specific lesion diagnosis) was lower, ranging from 43% to 93%. Reader studies, where dermatologists’ classification is compared with AI model outcomes, reported similar accuracy in one study, higher AI accuracy in three studies, and higher clinician accuracy in two studies. A quality review revealed that dataset description and variety, benchmarking, public evaluation, and healthcare application were frequently not addressed. Conclusions: While this review provides promising evidence of accurate AI models in populations with skin of color, the majority of the studies reviewed were obtained from East Asian populations and therefore provide insufficient evidence to comment on the overall accuracy of AI models for darker skin types. Large discrepancies remain in the number of AI models developed in populations with skin of color (particularly Fitzpatrick type IV–VI) compared with those of largely European ancestry. A lack of publicly available datasets from diverse populations is likely a contributing factor, as is the inadequate reporting of patient-level metadata relating to skin color in training datasets.
Read full abstract