Interobserver agreement among multiple generalists or specialists are comparable to that of recognized experts: Prospective acceptability benchmarks for H&N from the C3RO crowdsourced initiative

D Lin,K.A Wahid,B.E Nelms,R He,M Naser,S Duke,M.V Sherer,M Cislo,J.D Murphy,E.F Gillespie,C.D Fuller

doi:10.1016/j.ijrobp.2022.06.019

Abstract

Background A challenge in the development of artificial intelligence (AI) is the lack of multi-expert observer datasets large enough to train deep learning models; this is particularly true for head and neck (H&N) cases, which have high interobserver segmentation variability. As such, we created Contouring Collaborative for Consensus in Radiation Oncology (C3RO), a crowdsourced challenge engaging international radiation oncologists in cloud-based contouring, to evaluate whether collective contours generated from large numbers of non-experts could meet or exceed expert interobserver agreement, the current "gold standard," in the segmentation of a H&N case. Methods Participants who contoured at least one region of interest (ROI) for the C3RO H&N challenge were categorized as generalist, self-identified specialist, or recognized expert. Cohort-specific ROIs were combined into single simultaneous truth and performance level estimation (STAPLE) consensus segmentations. STAPLEgeneralist ROIs or STAPLEspecialist ROIs were evaluated against STAPLEexpert contours using Dice Similarity Coefficient (DSC). The expert interobserver DSC (IODSCexpert) was calculated as a performance acceptability threshold between STAPLEgeneralist or STAPLEspecialist versus STAPLEexpert. To determine the number of generalists required to match the IODSCexpert for each ROI, a single STAPLEbootstrap consensus contour was generated for a 10-fold random-bootstrap using a variable number of generalists (between 2-25) and then compared to the IODSCexpert. Results This H&N challenge yielded contours from 58 generalists, 8 self-identified specialists, and 15 experts. The DSC for STAPLEgeneralist or STAPLEspecialist versus STAPLEexpert were both higher than their respective expert IODSCexpert for most ROIs, including the right parotid (STAPLEgeneralist/ STAPLEspecialist/ IODSCexpert, 0.95/ 0.94/ 0.87), left parotid (0.94/ 0.91/ 0.86), muscle constrictors (0.8/ 0.6/ 0.58), larynx (0.9/ 0.91/ 0.67), primary gross tumor volume (GTVp) (0.83/ 0.86/ 0.78), right submandibular gland (0.83/ 0.89/ 0.78), left submandibular gland (0.93/ 0.94/ 0.85), and clinical tumor volume 2 (CTV2) (0.84/ 0.82/ 0.7). The DSC for both STAPLEgeneralist and STAPLEspecialist were lower than their respective IODSCexpert for CTV1 (0.68/ 0.65/ 0.85). Interestingly, the DSC for STAPLEgeneralist was higher (0.96), while STAPLEspecialist was lower (0.8) when compared to the IODSCexpert (0.9) for the nodal GTV (GTVn). For the brainstem, the DSC for STAPLEspecialist (0.86) exceeded IODSCexpert (0.82), while the DSC for STAPLEgeneralist did not (0.8). The theoretical minimum number of generalist segmentation needed to cross the IODSCexpert acceptability threshold ranged between 2-5 for all H&N ROIs. Discussion Results show that 5+ generalists could potentially create consensus ROIs with performance approximating an individual expert, facilitating feasible mechanism to improve AI algorithm development for H&N. A challenge in the development of artificial intelligence (AI) is the lack of multi-expert observer datasets large enough to train deep learning models; this is particularly true for head and neck (H&N) cases, which have high interobserver segmentation variability. As such, we created Contouring Collaborative for Consensus in Radiation Oncology (C3RO), a crowdsourced challenge engaging international radiation oncologists in cloud-based contouring, to evaluate whether collective contours generated from large numbers of non-experts could meet or exceed expert interobserver agreement, the current "gold standard," in the segmentation of a H&N case. Participants who contoured at least one region of interest (ROI) for the C3RO H&N challenge were categorized as generalist, self-identified specialist, or recognized expert. Cohort-specific ROIs were combined into single simultaneous truth and performance level estimation (STAPLE) consensus segmentations. STAPLEgeneralist ROIs or STAPLEspecialist ROIs were evaluated against STAPLEexpert contours using Dice Similarity Coefficient (DSC). The expert interobserver DSC (IODSCexpert) was calculated as a performance acceptability threshold between STAPLEgeneralist or STAPLEspecialist versus STAPLEexpert. To determine the number of generalists required to match the IODSCexpert for each ROI, a single STAPLEbootstrap consensus contour was generated for a 10-fold random-bootstrap using a variable number of generalists (between 2-25) and then compared to the IODSCexpert. This H&N challenge yielded contours from 58 generalists, 8 self-identified specialists, and 15 experts. The DSC for STAPLEgeneralist or STAPLEspecialist versus STAPLEexpert were both higher than their respective expert IODSCexpert for most ROIs, including the right parotid (STAPLEgeneralist/ STAPLEspecialist/ IODSCexpert, 0.95/ 0.94/ 0.87), left parotid (0.94/ 0.91/ 0.86), muscle constrictors (0.8/ 0.6/ 0.58), larynx (0.9/ 0.91/ 0.67), primary gross tumor volume (GTVp) (0.83/ 0.86/ 0.78), right submandibular gland (0.83/ 0.89/ 0.78), left submandibular gland (0.93/ 0.94/ 0.85), and clinical tumor volume 2 (CTV2) (0.84/ 0.82/ 0.7). The DSC for both STAPLEgeneralist and STAPLEspecialist were lower than their respective IODSCexpert for CTV1 (0.68/ 0.65/ 0.85). Interestingly, the DSC for STAPLEgeneralist was higher (0.96), while STAPLEspecialist was lower (0.8) when compared to the IODSCexpert (0.9) for the nodal GTV (GTVn). For the brainstem, the DSC for STAPLEspecialist (0.86) exceeded IODSCexpert (0.82), while the DSC for STAPLEgeneralist did not (0.8). The theoretical minimum number of generalist segmentation needed to cross the IODSCexpert acceptability threshold ranged between 2-5 for all H&N ROIs. Results show that 5+ generalists could potentially create consensus ROIs with performance approximating an individual expert, facilitating feasible mechanism to improve AI algorithm development for H&N.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Interobserver agreement among multiple generalists or specialists are comparable to that of recognized experts: Prospective acceptability benchmarks for H&N from the C3RO crowdsourced initiative

Abstract

Talk to us

Similar Papers

More From: International Journal of Radiation Oncology, Biology, Physics

Lead the way for us

Similar Papers

Interobserver Agreement among Multiple Generalists is Comparable to that of Recognized Experts: Prospective Acceptability Benchmarks from the C3RO Crowdsourced Initiative
D Lin ... C.D Fuller
International Journal of Radiation Oncology*Biology*Physics | VOL. 114
D Lin, et. al.D Lin ... C.D Fuller
22 Oct 2022
International Journal of Radiation Oncology*Biology*Physics | VOL. 114

E pluribus unum: prospective acceptability benchmarking from the Contouring Collaborative for Consensus in Radiation Oncology crowdsourced initiative for multiobserver segmentation.
Diana Lin ... Kareem A Wahid
Journal of Medical Imaging | VOL. 10
Diana Lin, et. al.Diana Lin ... Kareem A Wahid
08 Feb 2023
Journal of Medical Imaging | VOL. 10

Primary Tumor Extension Probability-based Prediction of Individualized Clinical Target Volume for Nasopharyngeal Carcinoma
Y Sun ... J Ma
International Journal of Radiation Oncology*Biology*Physics | VOL. 102
Y Sun, et. al.Y Sun ... J Ma
20 Oct 2018
International Journal of Radiation Oncology*Biology*Physics | VOL. 102

Novel strategies to improve target delineation for head and neck cancers
E Butler ... V Shukla
International Journal of Radiation OncologyBiologyPhysics | VOL. 60
E Butler, et. al.E Butler ... V Shukla
01 Sep 2004
International Journal of Radiation OncologyBiologyPhysics | VOL. 60

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Interobserver agreement among multiple generalists or specialists are comparable to that of recognized experts: Prospective acceptability benchmarks for H&N from the C3RO crowdsourced initiative

Abstract

Talk to us

Similar Papers

More From: International Journal of Radiation Oncology, Biology, Physics