Abstract Background Central reading in ulcerative colitis (UC) trials has high interobserver variability, typically managed through 2 + 1 adjudication when 2 readers disagree on the endoscopic subscore (40% of cases)1,2. While artificial intelligence (AI) models show high agreement with human readers3, relying solely on AI in medical imaging raises concerns. Our novel 2M+1H approach uses human adjudication (H) only when two independently developed AI models (2M) disagree, preserving human oversight while affording the reproducibility and efficiency provided by AI4. The models’ independent development and performance characteristics ensure distinct “personalities,” similar to independent human readers. Methods Both models 1 and 2 were previously developed and validated deep learning models. In addition to distinct approaches, training was conducted by separate teams utilizing different training data. Model 1 was trained using a self-supervised feature extractor while model 2 was trained on supervised severity models consisting only of human-identifiable features. A panel of 10 blinded central readers (CRs) were selected and randomly assigned to provide independent human reads. 150 lower endoscopy videos from routine practice with a confirmed diagnosis of UC were independently assessed through the traditional 2 + 1 approach and the 2M+1H approach. The “final score” generated by each reading workflow was compared using quadratic weighted kappa (QWK). Noninferiority was determined if the lower bounds of the 95% confidence intervals (CIs), computed via bootstrapping, exceeded the QWK margin-adjusted threshold (set at -10%). Results The 2M+1H approach achieved good agreement with traditional 2 + 1 human central reading (QWK 0.78, 95% CI 0.69-0.84) and is statistically noninferior (Fig 1). Binary classification of endoscopic improvement and endoscopic remission was very good (agreement rate 82.7% (95% CI 76.7%-88.7%) and 89.3% (95% CI 84.7%-94.0%). Comparison of the two ML models showed statistically similar behavior to that of the first two human CRs in the 2 + 1 approach (QWK model 1 vs model 2: 0.74 (95% CI 0.66-0.80), QWK CR 1 vs CR2: 0.78 (95% CI 0.70-0.85), p = 0.80). The 2M+1H approach reduced the total number of human reads from 349 (2 + 1) to 67 (2M+1H), an 81% decrease, decreasing the total number of human reads per video (0.45 vs 2.33, p < 0.001) (Fig 2). Conclusion Our data demonstrates that 2M+1H central reading achieved comparable performance to traditional 2 + 1 human central reading and is statistically noninferior while significantly reducing human workload. Prospective studies may adopt this approach to leverage ML as a regulatory-aligned alternative to the 2 + 1 human reading approach in UC clinical trials. References Food and Drug Administration. Ulcerative Colitis: Developing Drugs for Treatment [Internet]. 2022 [cited 2024 Oct 17]. Available from: https://www.fda.gov/regulatory-information/search-fda-guidance-documents/ulcerative-colitis-developing-drugs-treatment Gottlieb, K., Daperno, M., Usiskin, K., Sands, B. E., Ahmad, H., Howden, C. W., . . . & Reinisch, W. (2021). Endoscopy and central reading in inflammatory bowel disease clinical trials: Achievements, challenges and future developments. Gut, 70(2), 418-426. Rimondi, A., Gottlieb, K., Despott, E. J., Iacucci, M., Murino, A., & Tontini, G. E. (2023). Can artificial intelligence replace endoscopists when assessing mucosal healing in ulcerative colitis? A systematic review and diagnostic test accuracy meta-analysis. Digestive and Liver Disease, 56(7):1164-1172. Cohen, I. G., Babic, B., Gerke, S., Xia, Q., Evgeniou, T., & Wertenbroch, K. (2023). How AI can learn from the law: Putting humans in the loop only on appeal. npj Digital Medicine, 6(1), 160.
Read full abstract