Abstract

Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization, nonparanormal transformation, and a simple log2 transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.

Highlights

  • A wealth of gene expression data is being made publicly available by consortia such as The Cancer Genome Atlas (TCGA) (Cancer Genome Atlas Network, 2012)

  • We developed Training Distribution Matching (TDM), a new method of RNA-seq data normalization intended for prediction using machine learning models built on microarray data and improved clustering

  • TDM performed well compared to quantile normalization, nonparanormal transformation, and log2 transformation on a range of data

Read more

Summary

Introduction

A wealth of gene expression data is being made publicly available by consortia such as The Cancer Genome Atlas (TCGA) (Cancer Genome Atlas Network, 2012) Such large datasets provide the opportunity to discover signals in gene expression that may not be apparent with smaller sample sizes, such as prognostic indicators or predictive factors, for subsets of patients. These approaches often construct a model that captures relevant features of a dataset, and the model can be used to make predictions about new data, such as how well a patient will respond to a particular treatment (Geeleher, Cox & Huang, 2014), or whether their cancer is likely to recur (Kourou et al, 2014). The model is usually constructed using a large, diverse dataset and is applied to incoming cases to make predictions about them

Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.