298 Improving AI Assessment of Cutaneous Chronic Graft-Versus-Host Disease using Unlabeled Patient Photographs

Andrew McNeil,Kesley Parks,Edward Cowen,Julia Lehman,Dominique Pichard,Michi Shinohara,Mary Flowers,Benoit Dawant,Eric Tkaczyk

doi:10.1017/cts.2024.272

Abstract

OBJECTIVES/GOALS: Measuring the area of skin involvement in chronic graft-versus-host disease (cGVHD) relies on costly, time-consuming manual assessment, with high disagreement among experts (>20%). Our published AI method, trained on labeled 3D photos, showed promise for delineating affected areas. We aim to improve its performance using unlabeled 2D photos. METHODS/STUDY POPULATION: Our published AI model (baseline) was trained on 360 labeled photos of 36 cGVHD patients,from a 3D camera with calibrated distance and lighting.Our gold standard labels were contours around affected skin, marked by a trained expert. A second unlabeledcohort of 974 standard 2D photos of 8 cGVHD patients was used to improve the baseline model. First the baseline model predicted affected areas on the unlabeled photos. Photos with good predictions were added to the training set with their AI-predicted labels. The model was then re-trained with the expanded labeled set. Models were successively trained with more AI labels until performance stopped improving. AI performance was assessed on a test set of 20 photos from 20 patients unseen during training, labeled by 4 experts to improve accuracy. RESULTS/ANTICIPATED RESULTS: Model performance was calculated by comparing against the gold standard labels on the test set. To quantify the spatial overlap of labeled areas the Dice coefficient was used (0 is no overlap, 1 is complete agreement), where higher values are better. To estimate clinical error we used surface area error (Error), where lower values are better. On the test set, the baseline model had a median Dice of 0.57 [interquartile range: 0.39 – 0.82] and Error of 57.6% [20.2 – 103.3%]. Re-training with additional AI-predicted labels from 8 new patients, the model yielded a median Dice of 0.60 [0.35 – 0.80] and Error of 50% [12.5 – 103.8%]. This approach is being expanded to a further 300 unlabeled patients, where we anticipate significant improvements to AI performance and consistency. DISCUSSION/SIGNIFICANCE: Evaluating AI models in standard photos could provide a consistent method of assessing and tracking cutaneous cGVHD and relieve the burden of costly expert assessment. A reliable automated AI tool would provide a meaningful improvement to the current standard of manual assessment and could be easily applied to large patient cohorts.

Full Text