Abstract Background Regulatory guidance recommends the endoscopy subscore, a component of the modified Mayo Score, as a primary endpoint in ulcerative colitis (UC) therapeutic trials. The endoscopy subscore is typically assessed via the 2 + 1 central reading paradigm, a workflow involving independent video review by multiple readers and, most commonly, statistical methods in cases of disagreement1. Disagreement in centrally read endoscopic assessments can impact the reliability and reproducibility of trial results2. Machine learning (ML) provides an opportunity for standardization. The objective of this study is to evaluate a novel ML model to assess the endoscopy subscore in UC trials compared to a 2 + 1 reference standard. Methods Endoscopic videos from the mirikizumab Phase 2 (NCT02589665) and Phase 3 induction (NCT03518086) trials in UC were added to a database of endoscopic recordings from routine practice to create a cohort of 18,169 videos. 639 videos (~25%) from the Phase 3 induction trial (week 0 and 12 procedures) with a 2 + 1 centrally read endoscopy subscore, per protocol, were randomly selected with a distribution of endoscopic severity similar to the overall study population (82.2% moderate-to-severe) and held out to evaluate performance of the final, locked model. Remaining videos were used to develop a state-of-the-art multi-stage deep learning algorithm to assess the endoscopy subscore on full-length UC endoscopic videos. Quadratic weighted kappa (QWK) was used to evaluate the inter-rater agreement between the model-assessed endoscopy subscore and the 2 + 1 reference standard. Results In the holdout test cohort (n=639) there was 62.4% agreement rate between the first two human readers (local vs central reader 1) in determining the centrally read endoscopy subscore with 2.5% cases of disagreement by 2 classes, in line with published data3,4. QWK between the model-assessed endoscopy subscore and the 2 + 1 reference standard was 0.77 (95% confidence interval (CI) 0.74-0.80). Disagreement by 2 classes occurred in 1.9% of videos. Binary accuracy for inactive-to-mild vs moderate-to-severe disease and inactive vs mild-to-moderate-to-severe disease, definitions of endoscopic improvement and remission in trials, was 89.8% (95% CI 87.5-92.2%) and 94.2% (95% CI 92.4-96.0%), respectively. Conclusion This ML algorithm effectively assesses the endoscopy subscore on full-length endoscopic videos in UC. Given the performance, ML models such as this could potentially standardize the assessment of inflammation, addressing a notable challenge currently posed in UC therapeutic development programs due to inconsistencies among human readers. Future research will investigate ML-based reading paradigms for the assessment of endoscopic endpoints in trials. References Food and Drug Administration. Ulcerative Colitis: Developing Drugs for Treatment [Internet]. 2022 [cited 2024 Oct 17]. Available from: https://www.fda.gov/regulatory-information/search-fda-guidance-documents/ulcerative-colitis-developing-drugs-treatment Wils, P., Jairath, V., Sands, B. E., Magro, F., Reinisch, W., Rubin, D., . . . & Peyrin-Biroulet, L. (2023). Comparison of treatment effect between phase 2 and phase 3 trials in patients with inflammatory bowel disease. United European Gastroenterology Journal, 11(8), 797-806. Hashash, J. G., Yu Ci Ng, F., Farraye, F. A., Wang, Y., Colucci, D. R., Baxi, S., . . . & Melmed, G. Y. (2024). Inter-and intraobserver variability on endoscopic scoring systems in crohn’s disease and ulcerative colitis: A systematic review and meta-analysis. Inflammatory Bowel Diseases, izae051. Feagan, B. G., Khanna, R., Sandborn, W. J., Vermeire, S., Reinisch, W., Su, C., . . . & Sands, B. E. (2021). Agreement between local and central reading of endoscopic disease activity in ulcerative colitis: Results from the tofacitinib OCTAVE trials. Alimentary Pharmacology & Therapeutics, 54(11-12), 1442-1453.
Read full abstract