PD01-11 AUTOMATED SURGICAL SKILLS ASSESSMENT: CONSENSUS BUILDING TOWARDS HUMAN-DERIVED GROUND TRUTH SCORES FOR MACHINE LEARNING ALGORITHMS

Timothy N Chu,Elyssa Y Wong,Cherine H Yang,Mitchell G Goldenberg,Andrew J Hung,Daniel I Sanford,Istabraq S Dalieh,Runzhuo Ma

doi:10.1097/ju.0000000000003218.11

Abstract

You have accessJournal of UrologyCME1 Apr 2023PD01-11 AUTOMATED SURGICAL SKILLS ASSESSMENT: CONSENSUS BUILDING TOWARDS HUMAN-DERIVED GROUND TRUTH SCORES FOR MACHINE LEARNING ALGORITHMS Timothy N. Chu, Daniel I. Sanford, Elyssa Y. Wong, Runzhuo Ma, Cherine H. Yang, Istabraq S. Dalieh, Mitchell G. Goldenberg, and Andrew J. Hung Timothy N. ChuTimothy N. Chu More articles by this author , Daniel I. SanfordDaniel I. Sanford More articles by this author , Elyssa Y. WongElyssa Y. Wong More articles by this author , Runzhuo MaRunzhuo Ma More articles by this author , Cherine H. YangCherine H. Yang More articles by this author , Istabraq S. DaliehIstabraq S. Dalieh More articles by this author , Mitchell G. GoldenbergMitchell G. Goldenberg More articles by this author , and Andrew J. HungAndrew J. Hung More articles by this author View All Author Informationhttps://doi.org/10.1097/JU.0000000000003218.11AboutPDF ToolsAdd to favoritesDownload CitationsTrack CitationsPermissionsReprints ShareFacebookLinked InTwitterEmail Abstract INTRODUCTION AND OBJECTIVE: Artificial Intelligence (AI)-based assessments of surgical skills may provide objective evaluation of surgeon skill in a scalable manner. However, human scoring used as the ground truth reference standard in AI development may introduce subjectivity and error. Here, we aim to assess the value of a previously validated four-round consensus building process to limit rater bias prior to training machine learning (ML) algorithms with human-derived labels. METHODS: Three different datasets derived from VR suturing exercises completed on the Mimic™ Flex VR robotic simulator were included. Deidentified participant videos were provided binary technical scores for various suturing skills (needle positioning, entry angle, needle driving, needle withdrawal) using a validated assessment tool. Different combinations of three blinded and independent human raters achieved score consensus after undergoing a standardized four-round consensus building process (Figure 1a). Proportions of minority scores (e.g., one rater does not agree with two others) remaining after each round were tracked to determine trends in the consensus building process. RESULTS: In total, 5634 suturing skill assessments across all exercises were included. After initial video review (Round 1), 32% of assessments had a reviewer that provided a minority score. Following an individual review of minority scores (Round 2), 800/1816 (44.1%) of the minority scores persisted for group review across the three different exercises. Of the minority scores that persisted to Round 3 (group review), 274/800 (34.3%) became the final ground truth score. Out of all original minority scores, 274/1816 (15.1%) persisted through the entire consensus building process and became the ground truth labels (Figure 1b). CONCLUSIONS: When using multiple human raters, a considerable proportion of initial minority scores were ultimately agreed on as the ground truth skill assessment. This finding was consistent across multiple datasets, domains, and various raters. The standardized consensus-building process is an essential step in the creation of accurate ML models and underscores the significant impact human rater bias has on evaluations of surgical performance. Source of Funding: Research reported in this publication was supported by the National Cancer Institute of the National Institutes of Health under Award Number R01CA251579. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health © 2023 by American Urological Association Education and Research, Inc.FiguresReferencesRelatedDetails Volume 209Issue Supplement 4April 2023Page: e67 Advertisement Copyright & Permissions© 2023 by American Urological Association Education and Research, Inc.MetricsAuthor Information Timothy N. Chu More articles by this author Daniel I. Sanford More articles by this author Elyssa Y. Wong More articles by this author Runzhuo Ma More articles by this author Cherine H. Yang More articles by this author Istabraq S. Dalieh More articles by this author Mitchell G. Goldenberg More articles by this author Andrew J. Hung More articles by this author Expand All Advertisement PDF downloadLoading ...

Full Text