Abstract

BackgroundTumor purity is the percent of cancer cells present in a sample of tumor tissue. The non-cancerous cells (immune cells, fibroblasts, etc.) have an important role in tumor biology. The ability to determine tumor purity is important to understand the roles of cancerous and non-cancerous cells in a tumor.MethodsWe applied a supervised machine learning method, XGBoost, to data from 33 TCGA tumor types to predict tumor purity using RNA-seq gene expression data.ResultsAcross the 33 tumor types, the median correlation between observed and predicted tumor-purity ranged from 0.75 to 0.87 with small root mean square errors, suggesting that tumor purity can be accurately predicted υσινγ expression data. We further confirmed that expression levels of a ten-gene set (CSF2RB, RHOH, C1S, CCDC69, CCL22, CYTIP, POU2AF1, FGR, CCL21, and IL7R) were predictive of tumor purity regardless of tumor type. We tested whether our set of ten genes could accurately predict tumor purity of a TCGA-independent data set. We showed that expression levels from our set of ten genes were highly correlated (ρ = 0.88) with the actual observed tumor purity.ConclusionsOur analyses suggested that the ten-gene set may serve as a biomarker for tumor purity prediction using gene expression data.

Highlights

  • Tumor purity is the percent of cancer cells present in a sample of tumor tissue

  • We showed that EXtreme Gradient Boosting (XGBoost) can accurately predict tumor purity values using gene expression data alone

  • For The Cancer Genome Atlas (TCGA) data and an independent set of non-TCGA samples, we showed that predictions based on the expression levels of only these top ten genes is almost as accurate as predictions based on using all genes

Read more

Summary

Introduction

Tumor purity is the percent of cancer cells present in a sample of tumor tissue. The non-cancerous cells (immune cells, fibroblasts, etc.) have an important role in tumor biology. The ability to determine tumor purity is important to understand the roles of cancerous and non-cancerous cells in a tumor. The tumor microenvironment consists of non-cancerous stromal cells present in and around a tumor; these include immune cells, fibroblasts, and cells that comprise supporting blood vessels and others. The Cancer Genome Atlas (TCGA) provided comprehensive datasets for more than 10,000 samples in more than 30 tumor types [3]. Those studies provide valuable information about genomic changes in tumor samples compared to normal samples.

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.