Abstract The National Cancer Institute’s (NCI) Surveillance, Epidemiology, and End Results (SEER) registries maintain and organize cancer incidence information allowing researchers to derive valuable insights into cancer epidemiology. While significant attention has been devoted to identifying cancers either from clinical text or through tabular data collected by SEER registries, there has been less emphasis on integrating these distinct modes of data. In our multimodal deep learning approach, we use longitudinal tabular data from the Consolidated Tumor Case (CTC) database that encompass a patient’s past diagnoses. This tabular information can augment clinical text to aid in the classification of pathology reports indicative of recurrent cancers. Four NCI SEER registries (Louisiana, New Jersey, Seattle and Utah) have manually labeled 61,150 pathology reports with one of six categories, which we refine into a four-class classification problem. Each pathology report is identified as either positive for recurrence, negative for recurrence/not disease free, new tumor, or an “other” (no malignancy/uncertain) class. Natural Language Processing techniques can extract meaningful information from clinical pathology reports, aiding in the identification of subtle indicators of recurrence by using relevant context. We use a hierarchical self-attention model (HiSAN) to construct document embeddings and classify the pathology report. To further enhance the predictive accuracy of our modeling approach we fuse the textual information from a pathology report with categorical data about patient’s cancer history. For each report, we create a patient context vector that encapsulates tumor-level information from patient’s previous cancer(s). The selected CTC records are associated with cancers diagnosed more than 120 days before the date of biospecimen collection stated in the pathology report. The patient context vector is crafted based on diverse categorical features; including cancer staging, patient age, treatment and sites of metastasis at the time of diagnosis. Features are represented using a combination of one-hot encoding and binning. Additionally, we employ patient and feature-level normalization to maintain proportional significance of features for individuals with multiple past diagnoses. We present preliminary results corresponding to different approaches for classifying cancer recurrence; first, we observe that using only the pathology reports as input yields an accuracy of 68%. Secondly, when using only CTC features with an XGBoost model, we achieve an accuracy of 49%. Finally we show that leveraging multiple data modalities, i.e. HiSAN generated pathology report embeddings and CTC data, significantly improves the model’s predictive accuracy to 76%. This research demonstrates a promising path forward in enhancing classification of clinical text by incorporating longitudinal patient history data. Citation Format: Patrycja Krawczuk, Zachary Fox, Dakota Murdock, Jennifer Doherty, Antoinette Stroupe, Stephen M. Schwartz, Lynne Penberthy, Elizabeth Hsu, Serban Negoita, Valentina Petkov, Heidi Hanson. Multimodal machine learning for the automatic classification of recurrent cancers [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl):Abstract nr 2318.
Read full abstract