P13.05.A Image annotation guideline for invivo confocal laser endomicroscopy, interrater reliability and how to learn from medical consensus for machine learning algorithms

E Hewer,G Panitz,M C Preul,T Maragkou,J Elsner,J Schlegel,E Belykh,J Eschbacher,I Abramov,G Mignucci-Jimenez,F Swamy Von Zastrow,D Sadeghi,M Brunner,I U Ikeliani,K Quint,Y Xu

doi:10.1093/neuonc/noac174.285

Abstract

Abstract Background Intraoperative confocal laser endomicroscopy (CLE) is an in vivo imaging technique increasingly studied in neurosurgery and neuropathology. It can be affected by artifacts introduced by the CLE device or related to the intraoperative setting. We developed and evaluated an image annotation guideline (AGL) to detect and eliminate images bearing no valuable information as a result of such artifacts. Images ware classified into good and bad quality, based on defined technical criteria, which are also considered relevant by clinical experts. Material and Methods Datasets were created from intraoperative CLE in vivo specimens of patients resected for brain tumors. The process from data collection to development of the ML algorithm followed 7 steps: data quality specification, image and metadata collection, AGL development, annotation, data allocation for clinical validation, clinical validation, and, optionally, algorithm development. Final diagnoses were obtained by pathological analysis. Artifacts were grouped into three categories: diminished signal-to-noise-ratio (dSNR), optical distortions (movement/perturbations), and contrast/brightness artifacts. Images were annotated by 4 medical data annotators (T4). For clinical validation, 500 images were excluded from the training data and additionally annotated by 3 board certified neuropathologists (NPs 1-3) with experience in CLE imaging, to determine the medical consensus on good and bad images. All raters (NPs) were compared against each other and against T4; T4 was also compared against the medical consensus. Cohen’s Kappa and overall percentage agreement (OPA) were used to evaluate inter-rater reliability. Positive percent agreement (PPA) and negative percentage agreement (NPA) were also used to evaluate agreement between medical consensus and T4. Results 21,616 CLE images and corresponding clinical metadata were collected from 94 patients and annotated. For each case between 27 and 815 CLE images were acquired over the course of the surgery (mean=175 images per case, SD=170.6). 11% and 13% of images were labeled as dSNR and distortion, respectively, and 34% as class contrast. 42% of the images represented the good quality images. Interrater agreement between the 3 NPs ranged between 0.30 and 0.59. Agreement between T4 and the medical consensus was substantial (Cohen’s Kappa &gt;=0.61). OPA between T4 and the medical consensus was 80.60%, PPA 72.34% and NPA 87.92%. Conclusion Annotations according to a well-structured and expertly curated AGL show higher values for Cohen’s Kappa and Overall Percent Agreement (OPA) with the medical consensus, than that of individual experts among one another. Such an AGL can be considered appropriate and produces on par results with annotations by a group of experts in the field and can be further employed for training machine learning (ML) algorithms.

Full Text