Background A challenge in the development of artificial intelligence (AI) is the lack of multi-expert observer datasets large enough to train deep learning models; this is particularly true for head and neck (H&N) cases, which have high interobserver segmentation variability. As such, we created Contouring Collaborative for Consensus in Radiation Oncology (C3RO), a crowdsourced challenge engaging international radiation oncologists in cloud-based contouring, to evaluate whether collective contours generated from large numbers of non-experts could meet or exceed expert interobserver agreement, the current "gold standard," in the segmentation of a H&N case. Methods Participants who contoured at least one region of interest (ROI) for the C3RO H&N challenge were categorized as generalist, self-identified specialist, or recognized expert. Cohort-specific ROIs were combined into single simultaneous truth and performance level estimation (STAPLE) consensus segmentations. STAPLEgeneralist ROIs or STAPLEspecialist ROIs were evaluated against STAPLEexpert contours using Dice Similarity Coefficient (DSC). The expert interobserver DSC (IODSCexpert) was calculated as a performance acceptability threshold between STAPLEgeneralist or STAPLEspecialist versus STAPLEexpert. To determine the number of generalists required to match the IODSCexpert for each ROI, a single STAPLEbootstrap consensus contour was generated for a 10-fold random-bootstrap using a variable number of generalists (between 2-25) and then compared to the IODSCexpert. Results This H&N challenge yielded contours from 58 generalists, 8 self-identified specialists, and 15 experts. The DSC for STAPLEgeneralist or STAPLEspecialist versus STAPLEexpert were both higher than their respective expert IODSCexpert for most ROIs, including the right parotid (STAPLEgeneralist/ STAPLEspecialist/ IODSCexpert, 0.95/ 0.94/ 0.87), left parotid (0.94/ 0.91/ 0.86), muscle constrictors (0.8/ 0.6/ 0.58), larynx (0.9/ 0.91/ 0.67), primary gross tumor volume (GTVp) (0.83/ 0.86/ 0.78), right submandibular gland (0.83/ 0.89/ 0.78), left submandibular gland (0.93/ 0.94/ 0.85), and clinical tumor volume 2 (CTV2) (0.84/ 0.82/ 0.7). The DSC for both STAPLEgeneralist and STAPLEspecialist were lower than their respective IODSCexpert for CTV1 (0.68/ 0.65/ 0.85). Interestingly, the DSC for STAPLEgeneralist was higher (0.96), while STAPLEspecialist was lower (0.8) when compared to the IODSCexpert (0.9) for the nodal GTV (GTVn). For the brainstem, the DSC for STAPLEspecialist (0.86) exceeded IODSCexpert (0.82), while the DSC for STAPLEgeneralist did not (0.8). The theoretical minimum number of generalist segmentation needed to cross the IODSCexpert acceptability threshold ranged between 2-5 for all H&N ROIs. Discussion Results show that 5+ generalists could potentially create consensus ROIs with performance approximating an individual expert, facilitating feasible mechanism to improve AI algorithm development for H&N. A challenge in the development of artificial intelligence (AI) is the lack of multi-expert observer datasets large enough to train deep learning models; this is particularly true for head and neck (H&N) cases, which have high interobserver segmentation variability. As such, we created Contouring Collaborative for Consensus in Radiation Oncology (C3RO), a crowdsourced challenge engaging international radiation oncologists in cloud-based contouring, to evaluate whether collective contours generated from large numbers of non-experts could meet or exceed expert interobserver agreement, the current "gold standard," in the segmentation of a H&N case. Participants who contoured at least one region of interest (ROI) for the C3RO H&N challenge were categorized as generalist, self-identified specialist, or recognized expert. Cohort-specific ROIs were combined into single simultaneous truth and performance level estimation (STAPLE) consensus segmentations. STAPLEgeneralist ROIs or STAPLEspecialist ROIs were evaluated against STAPLEexpert contours using Dice Similarity Coefficient (DSC). The expert interobserver DSC (IODSCexpert) was calculated as a performance acceptability threshold between STAPLEgeneralist or STAPLEspecialist versus STAPLEexpert. To determine the number of generalists required to match the IODSCexpert for each ROI, a single STAPLEbootstrap consensus contour was generated for a 10-fold random-bootstrap using a variable number of generalists (between 2-25) and then compared to the IODSCexpert. This H&N challenge yielded contours from 58 generalists, 8 self-identified specialists, and 15 experts. The DSC for STAPLEgeneralist or STAPLEspecialist versus STAPLEexpert were both higher than their respective expert IODSCexpert for most ROIs, including the right parotid (STAPLEgeneralist/ STAPLEspecialist/ IODSCexpert, 0.95/ 0.94/ 0.87), left parotid (0.94/ 0.91/ 0.86), muscle constrictors (0.8/ 0.6/ 0.58), larynx (0.9/ 0.91/ 0.67), primary gross tumor volume (GTVp) (0.83/ 0.86/ 0.78), right submandibular gland (0.83/ 0.89/ 0.78), left submandibular gland (0.93/ 0.94/ 0.85), and clinical tumor volume 2 (CTV2) (0.84/ 0.82/ 0.7). The DSC for both STAPLEgeneralist and STAPLEspecialist were lower than their respective IODSCexpert for CTV1 (0.68/ 0.65/ 0.85). Interestingly, the DSC for STAPLEgeneralist was higher (0.96), while STAPLEspecialist was lower (0.8) when compared to the IODSCexpert (0.9) for the nodal GTV (GTVn). For the brainstem, the DSC for STAPLEspecialist (0.86) exceeded IODSCexpert (0.82), while the DSC for STAPLEgeneralist did not (0.8). The theoretical minimum number of generalist segmentation needed to cross the IODSCexpert acceptability threshold ranged between 2-5 for all H&N ROIs. Results show that 5+ generalists could potentially create consensus ROIs with performance approximating an individual expert, facilitating feasible mechanism to improve AI algorithm development for H&N.
Read full abstract