BT15 Development of a validated image grading scale to assess the quality of skin lesion images

Gillian X M Chin,Andrew Coon,Ewan Eadie,Colin Fleming

doi:10.1093/bjd/ljad113.381

Abstract

Abstract Image quality is important for diagnostic confidence. For teledermatologists, low-quality images cause greater uncertainty, necessitating more cautious decision-making. Likewise, artificial intelligence systems are trained to recognize low-quality images, and need to adjust margin of errors to account for increased uncertainty, to avoid costly mistakes such as missed cancer diagnoses. There is, at present, no publicly available, validated tool for skin lesion image quality. We describe our steps in developing a grading scale to assess image quality, based on real-world images from an existing teledermatology service. We conducted a literature search for evidence-based data in the field. We formulated items to evaluate directly our measurement of interest, through multiple focus group meetings with dermatologists and clinical photographers. Three key variables were identified: ‘focus’, ‘lighting’ and ‘composition’. All variables were interdependent, with ‘focus’ being most essential. We devised a 4-point grading scale, whereby images are visually assessed and assigned a score between grade 1 (low) and grade 4 (very high). Next, we tested on two cohorts: four dermatologists (clinical experts) and two clinical photographers (technical experts). A sample of 35 anonymized images from teledermatology and clinical photography (CP) evenly distributed across the scale was used. To reduce variability in user perception, participants received 10 min of presurvey training, followed by online testing under standardized settings. Inter-rater reliability (IRR) or user agreement was analysed using Fleiss’ kappa calculations. The all-user IRR showed moderate agreement (IRR 0.50), with modest improvement at in-group analysis [clinical photographers (IRR 0.58) and dermatologists (IRR 0.52)]. To estimate the IRR for each grade, we calculated the percentage agreement against a provisional benchmark. All-user percentage agreement was 67%, with the highest agreement for grade 1 (low, 83%) and lowest for grade 3 (high, 53%). Feedback on content relevance, usability and form was obtained, guiding further scale modification. We appraised scale discrimination through testing on a representative sample of real-world images from 200 consecutive referrals. Seventy-eight per cent of referrals had primary care-acquired images, and 21% of referrals had CP. On average, primary care-acquired images were graded lower [2.17, 95% confidence interval (CI) 2.04–2.30] than CP (3.89, 95% CI 3.77–4.00). On response distribution, the entire range could be used for its intended purpose. Grades of primary care-acquired image achieved were as follows: grade 1 (27%), grade 2 (30%), grade 3 (42%) and grade 4 (1%). Grades of CP images were as follows: grade 2 (2%), grade 3 (7%) and grade 4 (91%). This suggests a significant quality difference between acquisition modalities. Despite the inherent subjectivity of image quality assessment, these results provide a good platform for further scale development. The development of robust measurement instruments is multistep. The next stage is nonexpert evaluation and further validity and reliability testing.

Full Text