P206 Artificial Intelligence reproduces central reading for automated scoring of colonoscopies from multiple clinical trials in Ulcerative Colitis

B Gutierrez Becker,A Bigorgne,E Fisher,D Richmond,J Luscher,J Arús-Pous,M Prunotto,D Bojic,H Yao

doi:10.1093/ecco-jcc/jjac190.0336

Abstract

Abstract Background Using Artificial Intelligence (AI) for the automated scoring of colonoscopy videos with Mayo Clinic Endoscopic Subscore (MCES) has been successful in several models [1]. However, no study to date has evaluated the performance of such algorithms on a new cohort obtained from a separate clinical trial. In this study, we aimed to build an AI scoring algorithm able to reproduce central reading from three Phase III clinical trials in Ulcerative Colitis (UC). Methods As a training dataset, still frames were automatically extracted from sigmoidoscopy videos of two Etrolizumab clinical trials: Hickory NCT02100696 [2] and Laurel NCT02165215 [3]. A quality control AI algorithm [4] was applied and the resulting dataset (N=1897 videos) was paired with central reading MCES for each colon section. The training was achieved on a multiclass MCES scoring algorithm [4] with the challenging task of assigning a multiclass score (MCES between 0 - 3) instead of a binary score (MCES &lt;=1 or not). The performance was assessed on 2,292 videos from independent cohorts (Hibiscus I NCT02163759, Hibiscus II NCT02171429 [5], Gardenia NCT02136069 [6]). Videos were pre-processed in an identical manner as those from the training dataset (N= 637, 636 and 1019 videos for Hibiscus I, Hibiscus II and Gardenia, respectively). This dataset constitutes the largest cohort to date for the evaluation of a MCES scoring algorithm. Results The Area Under the Receiver Operator Characteristic curve (AUROC) was evaluated independently for each trial. Table 1 summarizes AUROC values per MCES for each clinical trial. The resulting mean AUROC values were 0.79 for Hibiscus I and 0.77 for both Hibiscus II and Gardenia. The average Cohen’s kappa between the automatic measurements and the central readings is 0.40, in contrast to 0.41 between central and local readers. Conclusion Our automated MCES scoring algorithm efficiently reproduces central reading in three independent clinical trials cohorts. In addition, algorithm evaluation was performed on data from independent trials, different from the ones used for the development of the algorithm. This is the first study evaluating, on such an extended cohort, the use of AI based automated MCES scoring in UC.

Full Text