Abstract

<h3>Purpose/Objective(s)</h3> Contouring Collaborative for Consensus in Radiation Oncology (C3RO) is a public crowdsourced challenge engaging radiation oncologists across various expertise levels in cloud-based image-segmentation. A challenge in artificial intelligence (AI) development is the relative paucity of multi-expert observer datasets sufficiently large to train deep learning; consequently, we sought to characterize whether aggregate segmentations generated from large numbers of generalists could meet or exceed expert interobserver agreement, the current "gold standard." <h3>Materials/Methods</h3> Participants who contoured at least one region of interest (ROI) for the C3RO breast or sarcoma challenge were identified as generalist or recognized expert. Cohort-specific ROIs were combined into single simultaneous truth and performance level estimation (STAPLE) consensus segmentations. STAPLE<sub>generalist</sub> ROIs were evaluated against STAPLE<sub>expert</sub> contours using Dice Similarity Coefficient (DSC). The expert interobserver DSC (IODSC<sub>expert</sub>) (i.e., pairwise median DSC across experts) was calculated as a performance acceptability threshold between STAPLE<sub>generalist</sub> and STAPLE<sub>expert</sub>. To determine the number of generalists required to match the IODSC<sub>expert</sub> for each ROI, a single STAPLE<sub>bootstrap</sub> consensus contour was generated for a 10-fold random-bootstrap using a variable number of generalists (between 2-25) and then compared to the IODSC<sub>expert</sub>. <h3>Results</h3> The breast challenge yielded contours from 124 generalists and 8 experts. The DSC for STAPLE<sub>generalist</sub> versus STAPLE<sub>expert</sub> were higher than their respective expert IODSC<sub>expert</sub> for all ROIs, including the axilla (STAPLE<sub>generalist</sub>/IODSC<sub>expert</sub>, 0.86/0.68), chest wall (0.91/0.67), heart (0.97/0.9), supraclavicular nodes (0.77/0.57), internal mammary nodes (0.66/0.46), left brachial plexus (0.46/0.2), and left anterior descending artery (0.62/0.32). The sarcoma case had contours from 61 generalists and 4 experts. The DSC between STAPLE<sub>generalist</sub> and STAPLE<sub>expert</sub> were higher than their respective IODSC<sub>expert</sub> for GTV (0.97/0.94) and CTV (0.76/0.69), but not genitalia (0.60/0.66). The theoretical minimum number of generalist segmentation needed to cross the IODSC<sub>expert</sub> acceptability threshold ranged between 2-4 for breast and 2-5 for sarcoma ROIs. <h3>Conclusion</h3> Multi-generalist-generated consensus ROIs met or exceeded expert-derived acceptability thresholds. Analyses suggest that 5+ generalists could potentially generate consensus ROIs with DSC performance approximating an individual expert, suggesting multi-generalist segmentations as a feasible AI input. Future research will explore whether these observations are site-specific and/or generalizable to more granular surface metrics.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call