Abstract PO-074: The impact of phenotypic bias in the generalizability of deep learning models in non-small cell lung cancer

Aidan Gilson,Justin Du,Roy Herbst,Sachin Umrao,Guneet Janda,Sanjay Aneja,Marina Joel,Rachel Choi,Harlan Krumholz

doi:10.1158/1557-3265.adi21-po-074

Abstract

Abstract Although deep learning analysis of diagnostic imaging has shown increasing effectiveness in modeling non-small cell lung cancer (NSCLC) outcomes, a minority of proposed deep learning algorithms have been externally validated. Given a majority of these models are built on single institutional datasets, their generalizability across the entire population remains understudied. Moreover, the effect of biases that exist among institutional training dataset on overall generalizability of deep learning prognostic models is unclear. We attempted to identify demographic and clinical characteristics which if over-represented within training data could affect the generalizability of deep learning models aimed at predicting survival in patients with non-small cell lung cancer (NSCLC). Using a dataset of pre-treatment CT images of 422 patients diagnosed with non-small cell lung cancer (NSCLC), we examined deep learning model performance across demographic and tumor specific factors. Demographic factors of interest included age and gender. Clinical factors of interest included tumor histology, overall stage, T-Stage, and N-Stage. The effect of bias among training data was examined by varying the representation of demographic and clinical populations within the training and validation datasets. Model generalizability was measured by comparing AUC values among validation datasets (biased versus unbiased). AUC was estimated using 1,000 bootstrapped samples of 400 patients from validation cohorts. We found training datasets with biased representation of NSCLC histologist to be associated with greatest decrease in generalizability. Specifically, we found over-representation of adenocarcinoma within training datasets to be associated with an AUC reduction of 0.320 (0.296 - 0.344 CI, p&lt;.001). Similarly over-representation of squamous cell carcinoma was associated with an AUC reduction of 0.177 (0.156 - 0.201 CI, p&lt;.001). Biases in age (AUC 0.103, p&lt;0.001), T stage (0.170, p=0.01 ), and N stage (0.120, p= 0.01) were also associated with reduced generalizability among deep learning models. Gender bias within training data was not associated with decreases in generalizability. Deep learning models of non-small cell lung cancer outcomes fail to generalize if trained on bias datasets. Specifically, overrepresentation of histologic subtypes may decrease the generalizability of deep learning models for NSCLC. Efforts to assure training data is representative of population demographics may lead to improved generalizability across more diverse patient populations. Citation Format: Aidan Gilson, Justin Du, Guneet Janda, Sachin Umrao, Marina Joel, Rachel Choi, Roy Herbst, Harlan Krumholz, Sanjay Aneja. The impact of phenotypic bias in the generalizability of deep learning models in non-small cell lung cancer [abstract]. In: Proceedings of the AACR Virtual Special Conference on Artificial Intelligence, Diagnosis, and Imaging; 2021 Jan 13-14. Philadelphia (PA): AACR; Clin Cancer Res 2021;27(5_Suppl):Abstract nr PO-074.

Full Text