Abstract

Various auto-segmentations, including deep learning auto-segmentation (DLAS), are being increasingly adopted in radiotherapy. A common method to evaluate quality of auto-segmented contours uses thresholds of various quantitative metrics (e.g., dice similarity coefficient (DSC), mean distance to agreement (MDA), etc.) that are often averaged over all contour slices. This method fails to detect contour errors on individual slices, thus, does not reflect the current clinical practice (slice-by-slice evaluation) and the clinical usability (e.g., expected contour editing time). In addition, the use of multi-metrics is generally not easy to interpret. This work aims to develop a novel contour quality classification (CQC) model to evaluate auto-segmented contours based on their clinical applicability. The CQC method was designed to classify a contour on a slice into acceptable, minor edit or major edit category, based on the expected editing effort/time. Organ-specific supervised ensemble tree classification models were trained to relate the slice-based quality category with the combination of seven commonly used calculatable quantitative metrics (i.e., DSC, MDA, Hausdorff 95% distance, surface DSC, added path length (APL), slice area and relative APL). The proposed method was demonstrated by training CQC models using DLAS contours of five abdominal organs (i.e., pancreas, duodenum, stomach, and small and large bowels) from 50 MRI sets and evaluating on 20 MRI and 9 CT testing sets. These test datasets were labelled by six individual observers and the consensus labels were generated through majority vote method. The model performance was evaluated using accuracy (acc), and risk rate (RR, the percentage of unacceptable slices mislabeled as acceptable) and compared with inter-observer variation and baseline threshold-based method. Compared to the majority vote labels, the obtained CQC models achieved a mean accuracy of 95.8% ([94.5%-99.1%]) and 94.3% ([90.6%-96.9%]), and the mean RR of 0.8% ([0.3%-1.3%]) and 0.7% ([0%-1.1%]) for the MRI and CT testing sets, respectively. The CQC performance was comparable to the inter-observer variation and significantly higher than those from the threshold-based method with single or multiple metrics. The execution time on a typical abdominal dataset (e.g., 70 slices) took less than 3 seconds. Table 1 CQC models performance for different organs CONCLUSION: The proposed CQC model can classify the quality of a contour slice with high accuracy. This slice-based single-output evaluation method better reflects the current clinical practice and may be used to evaluate/compare performance of DLAS on any image modality, facilitating its clinical implementation and quality assurance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call