Abstract
BackgroundAlthough ultrahigh-throughput RNA-Sequencing has become the dominant technology for genome-wide transcriptional profiling, the vast majority of RNA-Seq studies typically profile only tens of samples, and most analytical pipelines are optimized for these smaller studies. However, projects are generating ever-larger data sets comprising RNA-Seq data from hundreds or thousands of samples, often collected at multiple centers and from diverse tissues. These complex data sets present significant analytical challenges due to batch and tissue effects, but provide the opportunity to revisit the assumptions and methods that we use to preprocess, normalize, and filter RNA-Seq data – critical first steps for any subsequent analysis.ResultsWe find that analysis of large RNA-Seq data sets requires both careful quality control and the need to account for sparsity due to the heterogeneity intrinsic in multi-group studies. We developed Yet Another RNA Normalization software pipeline (YARN), that includes quality control and preprocessing, gene filtering, and normalization steps designed to facilitate downstream analysis of large, heterogeneous RNA-Seq data sets and we demonstrate its use with data from the Genotype-Tissue Expression (GTEx) project.ConclusionsAn R package instantiating YARN is available at http://bioconductor.org/packages/yarn.
Highlights
Ultrahigh-throughput RNA-Sequencing has become the dominant technology for genomewide transcriptional profiling, the vast majority of RNA-Seq studies typically profile only tens of samples, and most analytical pipelines are optimized for these smaller studies
Genotype-Tissue Expression (GTEx) release version 6.0 sampled over 550 donors with phenotypic information representing 9590 RNA-Seq assays performed on 54 conditions
In examining the Principal Coordinate Analysis (PCoA) plots, we found that samples from cerebellum and cerebellar hemisphere subregions were indistinguishable from each other, but these were very distinct from the other brain regions
Summary
Ultrahigh-throughput RNA-Sequencing has become the dominant technology for genomewide transcriptional profiling, the vast majority of RNA-Seq studies typically profile only tens of samples, and most analytical pipelines are optimized for these smaller studies. Projects are generating ever-larger data sets comprising RNA-Seq data from hundreds or thousands of samples, often collected at multiple centers and from diverse tissues. These complex data sets present significant analytical challenges due to batch and tissue effects, but provide the opportunity to revisit the assumptions and methods that we use to preprocess, normalize, and filter RNA-Seq data – critical first steps for any subsequent analysis. There are many large cohort studies, including the Genotype-Tissue Expression project (GTEx) and The Cancer Genome Atlas (TCGA) that have generated transcriptomic data on large populations and across multiple tissues or conditions to study patterns of gene expression [8, 9]. GTEx release version 6.0 sampled over 550 donors with phenotypic information representing 9590 RNA-Seq assays performed on 54 conditions
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have