Cross-platform normalization of microarray and RNA-seq data for machine learning applications.

Jeffrey A Thompson,Jie Tan,Casey S Greene

doi:10.7717/peerj.1621

Jeffrey A Thompson, Jie Tan + Show 1 more

Open Access

https://doi.org/10.7717/peerj.1621

Copy DOI

Journal: PeerJ	Publication Date: Jan 21, 2016
Citations: 96	License type: CC BY 4.0

Affiliation: Dartmouth College, University of Pennsylvania

Abstract

Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization, nonparanormal transformation, and a simple log2 transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.

Highlights

A wealth of gene expression data is being made publicly available by consortia such as The Cancer Genome Atlas (TCGA) (Cancer Genome Atlas Network, 2012)
We developed Training Distribution Matching (TDM), a new method of RNA-seq data normalization intended for prediction using machine learning models built on microarray data and improved clustering
TDM performed well compared to quantile normalization, nonparanormal transformation, and log2 transformation on a range of data

Summary

Introduction

A wealth of gene expression data is being made publicly available by consortia such as The Cancer Genome Atlas (TCGA) (Cancer Genome Atlas Network, 2012) Such large datasets provide the opportunity to discover signals in gene expression that may not be apparent with smaller sample sizes, such as prognostic indicators or predictive factors, for subsets of patients. These approaches often construct a model that captures relevant features of a dataset, and the model can be used to make predictions about new data, such as how well a patient will respond to a particular treatment (Geeleher, Cox & Huang, 2014), or whether their cancer is likely to recur (Kourou et al, 2014). The model is usually constructed using a large, diverse dataset and is applied to incoming cases to make predictions about them

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Cross-platform normalization of microarray and RNA-seq data for machine learning applications.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PeerJ

Lead the way for us

Similar Papers

Parallel comparison of Illumina RNA-Seq and Affymetrix microarray platforms on transcriptomic profiles generated from 5-aza-deoxy-cytidine treated HT-29 colon cancer cells and simulated datasets
Xiao Xu ... W Richard Mccombie
BMC Bioinformatics | VOL. 14
Xiao Xu, et. al.Xiao Xu ... W Richard Mccombie
01 Jun 2013
BMC Bioinformatics | VOL. 14

Reexamining Transplant Outcomes in Acute Kidney Injury Kidneys Through Machine Learning.
Caroline C Jadlowiec ... Wisit Cheungpasitporn
Clinical transplantation | VOL. 38
Caroline C Jadlowiec, et. al.Caroline C Jadlowiec ... Wisit Cheungpasitporn
01 Oct 2024
Clinical transplantation | VOL. 38

Quantifying Geometric Accuracy With Unsupervised Machine Learning: Using Self-Organizing Map on Fused Filament Fabrication Additive Manufacturing Parts
Mojtaba Khanzadeh ... Linkan Bian
Journal of Manufacturing Science and Engineering | VOL. 140
Mojtaba Khanzadeh, et. al.Mojtaba Khanzadeh ... Linkan Bian
21 Dec 2017
Journal of Manufacturing Science and Engineering | VOL. 140

User Sentiment Prediction and Analysis for Payment App Reviews Using Supervised and Unsupervised Machine Learning Approaches
Md Shamim Hossain ... Md Abdullah Al Noman
-
Md Shamim Hossain, et. al.Md Shamim Hossain ... Md Abdullah Al Noman
26 May 2023
26 May 2023

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Cross-platform normalization of microarray and RNA-seq data for machine learning applications.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PeerJ