How data analysis affects power, reproducibility and biological insight of RNA-seq studies in complex datasets.

Lucia Peixoto,Marcelo A Wood,Mathieu E Wimmer,Davide Risso,Ted Abel,Shane G Poplawski,Terence P Speed

doi:10.1093/nar/gkv736

Abstract

The sequencing of the full transcriptome (RNA-seq) has become the preferred choice for the measurement of genome-wide gene expression. Despite its widespread use, challenges remain in RNA-seq data analysis. One often-overlooked aspect is normalization. Despite the fact that a variety of factors or ‘batch effects’ can contribute unwanted variation to the data, commonly used RNA-seq normalization methods only correct for sequencing depth. The study of gene expression is particularly problematic when it is influenced simultaneously by a variety of biological factors in addition to the one of interest. Using examples from experimental neuroscience, we show that batch effects can dominate the signal of interest; and that the choice of normalization method affects the power and reproducibility of the results. While commonly used global normalization methods are not able to adequately normalize the data, more recently developed RNA-seq normalization can. We focus on one particular method, RUVSeq and show that it is able to increase power and biological insight of the results. Finally, we provide a tutorial outlining the implementation of RUVSeq normalization that is applicable to a broad range of studies as well as meta-analysis of publicly available data.

Highlights

The sequencing of the full transcriptome (RNA-seq) has become the preferred choice for the measurement of genomewide gene expression
The protocol was repeated over the course of 2 weeks to obtain 5 animals (2 hippocampi) per group (FC, retrieval of the memory (RT), corresponding controls (CC)) each representing an independent fear conditioning (FC) experiment, so that all animals for each group were dissected at the same time of day on different days
To further investigate how normalization of RNA-seq affects the detection of differential expression in the brain, we focused on long-term memory formation, since learning and memory paradigms are problematic (Figure 1)

Summary

Introduction

The sequencing of the full transcriptome (RNA-seq) has become the preferred choice for the measurement of genomewide gene expression. One often overlooked aspect is normalization, which is the transformation of values that allows comparisons between samples in a way that eliminates the effects of sources of variability that are not of interest. We refer to those effects as ‘unwanted variation’. A variety of technical and biological factors, collectively known as ‘batch effects’, contribute unwanted variation to genome-wide gene expression data. These factors include differences in amount of RNA, library preparation, equipment, operators, and procedures for sample extraction, preservation, or storage. Commonly used methods for RNA-seq normalization, such as upper quartile scaling (UQ)(2), trimmed mean of M values (TMM)(4) and FPKM [5], account only

Methods

Results

Conclusion