Abstract

Abstract Tumor purity is the percent of cancer cells present in a sample of tumor tissue. The noncancerous cells (stromal cells) in a tumor are thought to have an important role in tumor growth, metastatic progression, and drug resistance. They also strongly influence genomic analyses of tumor samples. The Cancer Genome Atlas (TCGA) has extensive RNA-seq data from tumor tissue samples as well as assessments of tumor purity for the samples. Our goal is to select a subset of genes whose expression levels are predictive of tumor purity for each tumor type as well as a subset of genes whose expression levels are predictive of all tumor type samples when pooled together. We hope that the genes selected may provide insight about the cell-type composition of tumor samples and about the similarities and differences in tumor microenvironments. We use data from the TCGA, which covers 11 different tumor types and includes genome-wide assessments on over 3,148 samples for gene expression. To identify predictive genes, we used XGBoost, a supervised machine learning algorithm based on the idea of a boosted regression tree ensemble. We carried out 100 repeated runs of 10-fold cross-validations (total of 1,000 train-test partitions) for each tumor type and, also, for all tumor types combined. Using the training-set samples, XGBoost selects a set of genes to predict tumor purity levels; the selected genes are subsequently used to predict the purity levels of the test-set samples. Across the 1,000 train-test partitions for all 11 tumor types, the average root-mean-squared error ranged from 0.09 to 0.16 for the test sets. For each tumor type, we selected the top 250 genes based on their aggregated feature importance scores, a measure of each gene's contribution to tumor purity estimation. No single gene was among the top 250 in all 11 tumor types; however, ACAP1, AMICA1, CSF2RB, CYTIP, GGT5, GLIPR1, IRF4, and PECAM1 were not only among the top 250 in more than 6 tumor types but also in the top 250 when all tumors were combined, suggesting those genes might serve as biomarkers for tumor purity. The most common pathways from gene ontology analysis of these top genes include various immune and signaling pathways. We used XGBoost to identify genes whose expression levels were associated with tumor purity levels in each tumor type. Our results suggest that assessed tumor purity levels in tumor samples can be faithfully recapitulated using certain subsets of genes. We believe that those genes selected for each tumor type by our unbiased approach might provide insight into the biology of the tumor microenvironment, e.g., the presence of cell type-specific marker genes would indicate the presence of specific cell types. Citation Format: YuanYuan Li, Adrienna Bingham, Qi-Jing Li, Yuan Zhuang, David M. Umbach, Leping Li. Using tumor sample gene expression data to infer tumor purity levels with stochastic gradient boosting machines [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2018; 2018 Apr 14-18; Chicago, IL. Philadelphia (PA): AACR; Cancer Res 2018;78(13 Suppl):Abstract nr 2255.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call