Ensemble voting decreases false positives in AI second-observer reads for detecting colorectal cancer.

Brandon Salinel,Corey Jensen,Sarah Zeien,John Chang,Jake Adkins,Vikram Kodibagkar,Hongzhi Wang,Phillip Koo,Tomislav Dragovich,Matthew Murphy,Tanveer Syeda-Mahmood,Michael A Choti,Madappa N Kundranda,Matthew Grudza,Curt Bay

doi:10.1200/jco.2022.40.4_suppl.141

Abstract

141 Background: Colorectal cancer (CRC) is the second leading cause of cancer-related deaths, and survival can be improved if early, suspect imaging features on CT of the abdomen and pelvis (CTAP) can be routinely identified. At present, up to 40% of these features are undiagnosed on routine CTAP, but this can be improved with a second observer. In this study, we developed a deep ensemble learning method for detecting CRC on CTAP to determine if increasing agreement between ensemble models can decrease the false positives detected by artificial intelligence (AI) second-observer. Methods: 2D U-Net convolutional neural network (CNN) containing 31 million trainable parameters was trained with 58 CRC CT images from Banner MD Anderson (AZ) and MD Anderson Cancer Center (TX) (51 used for training and 7 for validation) and 59 normal CT scans from Banner MD Anderson Cancer Center. 20 of the 25 CRC cases from public domain data (The Cancer Genome Atlas) were used to evaluate the performance of the models. The CRC was segmented using ITK-SNAP open-source software (v. 3.8). To apply the deep ensemble approach, five CNN models were trained independently with random initialization using the same U-Net architect and the same training data. Given a testing CT scan, each of the five trained CNN models was applied to produce tumor segmentation for the testing CT scan. The tumor segmentation results produced by the trained CNN models were then fused using a simple majority voting rule to produce consensus tumor segmentation results. The segmentation was analyzed by the percentage of correct detection, the number of false positives per case, and the Dice similarity coefficient (DSC). If parts of the CRC were flagged by AI, then it was considered correct. A detection was considered false positive if the marked lesion did not overlap with any CRC; contiguous false positives across different slices of CT image were considered a single false positive. DSC measures the quality of the segmentation by measuring the overlap between the ground-truth and AI detected lesion. Results: Our results showed that increasing the agreement between the 5 models dramatically decreases the number of false positives per CT at the expense of slight decrease in accuracy and DSC. This is described in the table. Conclusions: Our results show that AI-based second observer can potentially detect CRC on routine CTAP. Although the initial result yields high false positives per case, ensemble voting is an effective method for decreasing the false positives with a slight decrease in accuracy. This technique can be further improved for eventual clinical application.[Table: see text]

Full Text