Applying computer text mining algorithms for oversampling tumor mutation status in medical records for the NCI Patterns of Care studies

Benmei Liu,Jennifer Stevens,Gary Beverungen,Michael T Halpern

doi:10.1016/j.ijmedinf.2023.105157

Abstract

Backgrounds: The National Cancer Institute (NCI) conducts Patterns of Care (POC) studies for selected cancer sites under a Congressional Mandate. These studies aim to collect treatment information beyond what is typically collected by the NCI’s Surveillance, Epidemiology, and End Results (SEER) Program. The 2019 POC study focused on non-small cell lung cancer (NSCLC) and melanoma cancer sites. For the NSCLC cases, one of the primary sampling objectives was to oversample patients who tested positive for EGFR/ALK mutations, but initial information on mutation test results was unavailable prior to selecting the study sample. Methods: To address this, text mining algorithms were developed to screen all eligible NSCLC cases from the SEER database. These algorithms were designed to identify the mutation test status, allowing for stratified sampling based on SEER registry, sex, race/ethnicity, and tumor mutation test results. Results: The final NSCLC sample included 2,434 patients aged 20+ with advanced stage (IIIB-IVB) NSCLC diagnosed in 2017 and 2018. Among this sample, 692 cases (13.2%) tested positive for EGFR/ALK mutations. An evaluation of the text mining algorithms performance, based on cases where both algorithm results and known EGFR/ALK status from medical chart abstraction were available, showed good results: sensitivity of 77.6%, specificity of 90.8%, and an overall accuracy 84.8%. Conclusions: The adaption of text mining algorithm proved effective in oversample patients with uncommon conditions in studies where electronic medical records are accessible. The 2019 POC study provides valuable data for researchers to evaluate cancer therapy details and patient characteristics, particularly among those with EGFR/ALK test positive cases.

Full Text