To compare contours generated by a pre-trained, commercial AI model to those manually drawn in house. Previously treated radiotherapy patients (N = 20 per site) for several sites (brain, head & neck, thorax, abdomen, pelvis) with approved structure sets were selected for this retrospective analysis. For the planning CT of each patient, a pre-trained AI model auto-contoured several OARS: bladder, brain, eyes, femurs, kidneys, lenses, mandible, and parotids, etc. A two-step rejection method filtered the results based on unmatched structure names (i.e., contours existing in only one of the two structure sets causing incorrect auto-matching of structure names by the comparison algorithm), and structures with different superior and/or inferior extents. From the remaining contours, DSC (Dice Similarity Coefficient) and HD95% (95th percentile of Hausdorff Distance) were calculated between auto and manually generated contours using vendor-supplied software; median values were then calculated. The entire data set contained 592 structures at the on-set of analysis. After applying the rejection filters, the remaining data had 294 structures; a large portion of filtering was due to unmatched names. Out of these, OARs with contours from at least 10 patients (N>9) were further analyzed to include the 25th and 75th percentile for DSC and HD95%. Results from this analysis (structures with N>9) are presented in the table below, where results of left-right paired structures are combined into one row. The submandibular glands, larynx, and optical nerves (5 < N < 10) all had median DSC < 0.77 while lungs had median DSC > 0.98 (N = 6). Parotids and lenses had poor DSC and HD95% scores and may require significant contour editing to achieve agreement with our clinical conventions. This study highlights the difficulty with retrospective analysis of contours with an external trained model due to variations in the superior/inferior extent of tubular structures such as rectum, cord, esophagus, etc. CONCLUSION: Results are encouraging, given that the pre-trained commercial model has not seen our institutional data. The pre-trained AI contouring model matched very well to manual contours for large volume, higher contrast structures but did not match well for parotids and lenses. Variability in conventions regarding the superior/inferior extent of some structures hinders retrospective comparison with a pre-trained AI model.
Read full abstract