Abstract Background: Cancer disparities exist within the Hispanic/Latino/a/x (H/L) population by nativity status. Non-US born H/L colorectal cancer (CRC) patients and Prostate Cancer (PCa) patients in California have a statistically significant lower risk of cancer-specific survival when compared to Non-Hispanic White patients after controlling for confounders. However, birthplace information is largely incomplete in population-based cancer registries. For example, more than 40% of H/L patients have missing information on nativity in the California Cancer Registry (CCR) database. With the goal of better imputing missing birthplace, we built a Machine Learning (ML) model to predict nativity among CCR H/L cancer patients. Methods: H/L cases with primary invasive CRC (n = 75,613) and PCa (n = 93,042) at least 18 years old, diagnosed between 1988 and 2021 contained in the CCR research file were analyzed (46.36% had missing birthplace). A binary indicator variable assigning cases with known birthplace to US-born vs. Non-US born was constructed as the target for prediction. A stratified split by levels of the outcome was used to generate training (70%, n = 63,323) and testing (30%, n = 27,139) datasets. We used a pruned classification tree to perform feature selection, with the 10 features with the highest importance (measured by the overall reduction in impurity for any split containing each feature) used to predict nativity. Missing observations of the quantitative and qualitative features were mean, or mode imputed, respectively. The classification methods considered were classification trees, Random Forest (RF), and Boosting. Model performance was evaluated using the area under the receiver operating characteristic curve (AUC), sensitivity, specificity, and the rate of correctly classified observations. Results: The top three features with the highest importance were the NHIA (NAACCR Hispanic/Latino Identification Algorithm) label, the Hispanic/Spanish identity, and the year of social security number issuance (SSN); the latter is part of a current SSN-based algorithm used to impute nativity status in this population. Classification models showed similar prediction performance. The highest AUC was reported by the RF (AUC = 0.9822), followed by Boosting (AUC = 0.9785), and the pruned classification tree (AUC = 0.9647). A similar rate of correctly classified observations was reported for the Boosting and RF models (∼94%). The RF model ranked feature importance in a similar order as the pruned classification tree. Ultimately, the RF model was the best performing classifier with a sensitivity and specificity of 0.61 and 0.90, respectively. This classifier substantially outperforms the SSN-based algorithm, which reported a correct classification rate of 81%, a sensitivity = 0.54, and a specificity = 0.41 in this dataset. Conclusions: This proof-of-concept analysis showcases a potential for developing a ML model to overcome missing nativity status in H/L cancer patients to enable more accurate classification by nativity in future research using CCR data. Citation Format: Proof-of-Concept: A machine learning classifier to impute nativity status in Hispanic/Latino/a/x cancer patients in the California Cancer Registry. Joel Sanchez Mendez, Lihua Liu, Juan P. Lewinger, Laura Fejerman, Mariana C. Stern [abstract]. In: Proceedings of the 17th AACR Conference on the Science of Cancer Health Disparities in Racial/Ethnic Minorities and the Medically Underserved; 2024 Sep 21-24; Los Angeles, CA. Philadelphia (PA): AACR; Cancer Epidemiol Biomarkers Prev 2024;33(9 Suppl):Abstract nr A008.
Read full abstract