<h3>Purpose/Objective(s)</h3> Automated image segmentation algorithms are being introduced with limited clinical validation. The benefit of these algorithms is negated by the need for expert review and edits which can be more laborious than manual segmentation. This work introduces a systematic method for acceptance of automated segmentation including inter-organ and inter-patient variations. <h3>Materials/Methods</h3> Clinical validation of two artificial intelligence algorithms (A1-2) for computed tomography (CT) was performed. A dataset of 80 radiation therapy patients (P1-80) and contours of 34 different organs at risk (OARs) included head and neck, thorax, abdomen, and pelvis regions and an array of CT-scanning protocols. Nine segmentation comparisons were used including volume metrics, surface distance metrics, and statistical measures. The set of evaluation metrics were compared inter-OAR (averaged across all patients) and inter-patients (averaged across all OARs). This evaluation has the potential to identify problems in the algorithm for specific OARs and specific patients. <h3>Results</h3> Both algorithms were successful in many OARs, the inter-organ datasets successfully identified bladder, kidneys, liver, mandible, spinal canal, and stomach with above-average statistics across the population. For other organs, both algorithms consistently performed below average in most metrics including great vessels, heart, larynx, lens, optic nerves, and optic chiasm. In many of these cases, volume metrics were often < 0.5 and surface distance metrics were > 15mm. These OARs can indicate algorithm problems or institutional variations between human and AI-contours, e.g., in great vessels only the relevant region is included in human lung plans, not the entire aorta as contoured in A1. Similarly, A1 delineated the femoral head whereas institutionally femur includes contours inferiorly to the gluteal tuberosity. Based on per-organ statistics, we removed many contours for consideration including heart, esophagus, larynx, pharyngeal constrictors in A2. Inter-patient comparisons identified four inconsistent patients. P2 failed 5/9 metrics, investigation revealed the contours were generated on CTs from different timepoints. P8 failed 5/9 metrics due to A2 failures in bladder and rectum, A1 had no issues in this patient. P13 failed just one metric, investigation found lens, optic chiasm, and left optic nerve in the wrong anatomic region. P53 failed one metric: A1 failed in submandibular gland due to severe dental artifacts. <h3>Conclusion</h3> Clinical validation of automated contouring should be performed and compared to institutional data. The proposed process of per-organ and per-patient averages identified legitimate problems in algorithms in specific structures and variations with institutional norms. By identifying and removing problematic structures, AI-based segmentation may be more readily accepted clinically.
Read full abstract