Abstract The NCI's The Cancer Genome Atlas (TCGA) project profiled over 10,000 tumor samples over the course of 10 years. As different tissue-specific working groups reviewed all of the available data, these patient samples were separated into distinct molecular subtypes, and these clusters were reported in various marker papers. While these assignments provided invaluable information about the common patterns of molecular characteristics in different types of cancer there was no consistent methodology for assigning new samples to these defined molecular subtypes.The NCI's Tumor Molecular Pathology group was formulated to create machine learning-based models that could be applied to non-TCGA samples and determine their TCGA mapped subtypes. Five modeling systems, JADBio, SKGrid by the Oregon Health and Science University, CloudForest by the Institute of Systems Biology, AKLIMATE by University of California Santa Cruz and subSCOPE by BC Cancer’s Genome Sciences Centre, were trained to recognize TCGA subtypes using multi-omic measurements from gene expression, DNA methylation, miRNA expression, copy number, and somatic mutation calls. While the TCGA samples were profiled using multi-omic technologies, single platform and/or compact feature set models also were assessed for their ability to assign these classifications. Each machine learning system created predictive models for 106 subtypes from 26 cancer types using as few features as possible, with a maximum of 100 features allowed for scored models. A set of 411,706 models was developed, composed of results of each of the learning methods across the various omic platforms. Top models, both multi-omic and single platform, were selected for each cancer type. On average, models were able to achieve an overall weighted F1 score of 0.895 with 42 features. While the top models for each cancer type had an overall weighted F1 mean performance of 0.936 with a mean of 29 features, in 20 of the 26 cancer types models using only gene expression provided the best performance. Analysis of features selected by the models showed some known onco-drivers were selected by many models, but many times different models would utilize features of different genes with similar levels of performance. Network-level analysis revealed that many genes of these selected features operated within the same pathways.Transferability of these models to external datasets was tested, taking TCGA breast cancer trained models and applying them to AURORA and METABRIC datasets. Interestingly, despite the data platform difference between TCGA (RNAseq) and METABRIC (microarray), model performance saw only minimal degradation of F1 values in transfer. This set of models and the training dataset will provide new opportunities for researchers and translational scientists to connect new tumors to the subtypes seen in the TCGA cohorts. Citation Format: Kyle Ellrott, Chris K. Wong, Christina Yau, Mauro A. Castro, Jordan Lee, Brian Karlberg, Jasleen K. Grewal, Vincenzo Lagani, Bahar Tercan, Verena Friedl, Toshinori Hinoue, Vladislav Uzunangelov, Lindsay Westlake, Xavier Loinaz, Ina Felau, Peggy Wang, Anab Kemal, Samantha J. Caesar-Johnson, Ilya Shmulevich, Alexander J. Lazar, Ioannis Tsamardinos, Katherine A. Hoadley, The Cancer Genome Atlas Analysis Network, Gordon A. Robertson, Theo A. Knijnenburg, Christopher C. Benz, Joshua M. Stuart, Jean C. Zenklusen, Andrew D. Cherniack, Peter W. Laird. Leveraging compact feature sets for TCGA-based molecular subtype classification on new samples [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl):Abstract nr 6548.
Read full abstract