A comparison of per sample global scaling and per gene normalization methods for differential expression analysis of RNA-seq data.

Xiaohong Li,Liz O’Brien,Eric C Rouchka,Timothy E O’Toole,Guy N Brock,Abdallah M Eteleeb,Nigel G F Cooper,Dongfeng Wu,Shesh N Rai,Ryan S Gill

doi:10.1371/journal.pone.0176185

Xiaohong Li, Liz O’Brien + Show 8 more

Open Access

https://doi.org/10.1371/journal.pone.0176185

Copy DOI

Abstract

Normalization is an essential step with considerable impact on high-throughput RNA sequencing (RNA-seq) data analysis. Although there are numerous methods for read count normalization, it remains a challenge to choose an optimal method due to multiple factors contributing to read count variability that affects the overall sensitivity and specificity. In order to properly determine the most appropriate normalization methods, it is critical to compare the performance and shortcomings of a representative set of normalization routines based on different dataset characteristics. Therefore, we set out to evaluate the performance of the commonly used methods (DESeq, TMM-edgeR, FPKM-CuffDiff, TC, Med UQ and FQ) and two new methods we propose: Med-pgQ2 and UQ-pgQ2 (per-gene normalization after per-sample median or upper-quartile global scaling). Our per-gene normalization approach allows for comparisons between conditions based on similar count levels. Using the benchmark Microarray Quality Control Project (MAQC) and simulated datasets, we performed differential gene expression analysis to evaluate these methods. When evaluating MAQC2 with two replicates, we observed that Med-pgQ2 and UQ-pgQ2 achieved a slightly higher area under the Receiver Operating Characteristic Curve (AUC), a specificity rate > 85%, the detection power > 92% and an actual false discovery rate (FDR) under 0.06 given the nominal FDR (≤0.05). Although the top commonly used methods (DESeq and TMM-edgeR) yield a higher power (>93%) for MAQC2 data, they trade off with a reduced specificity (<70%) and a slightly higher actual FDR than our proposed methods. In addition, the results from an analysis based on the qualitative characteristics of sample distribution for MAQC2 and human breast cancer datasets show that only our gene-wise normalization methods corrected data skewed towards lower read counts. However, when we evaluated MAQC3 with less variation in five replicates, all methods performed similarly. Thus, our proposed Med-pgQ2 and UQ-pgQ2 methods perform slightly better for differential gene analysis of RNA-seq data skewed towards lowly expressed read counts with high variation by improving specificity while maintaining a good detection power with a control of the nominal FDR level.

Highlights

High-throughput RNA sequencing (RNA-seq) has become the preferred choice for gene expression studies due to technological advances allowing for increased transcriptome coverage and reduced cost
We observed that Full Quantile (FQ) and FPKM methods greatly increased the intra-condition variation compared to the un-normalized data and other normalization methods (Fig 1B)
Total Counts (TC), FPKM, RPKM and FQ are not suggested for use in differentially expressed genes (DEGs) analysis due to multiple issues such as lowly expressed gene issue for TC, length correction bias for FPKM and RPKM, and potentially increasing the intra-condition variation by forcing all the samples to have identical distributions for FQ [18,20,22,23]

Summary

Introduction

High-throughput RNA sequencing (RNA-seq) has become the preferred choice for gene expression studies due to technological advances allowing for increased transcriptome coverage and reduced cost These improvements have enabled studies with a large range of applications including identification of alternative splicing isoforms [1,2,3], de novo transcript assembly to identify novel genes and isoforms [4,5,6], detection of single-nucleotide polymorphisms (SNPs) [7,8] and novel single nucleotide variants (SNVs) [9], and characterization of mRNA editing [10]. Several sequencing platforms exist, which require similar sample pre-processing and subsequent analytical steps, as summarized by Zhang et al [14] This RNA-seq workflow consists of three major steps: 1) RNA-seq library construction; 2) sequencing and mapping; and 3) normalization and statistical modeling to identify the DEGs or transcript isoforms. Normalization is a crucial step in gene expression studies for both microarray and RNA-seq data [16,17,18,19]

Methods

Results

Conclusion