Abstract

Data normalization is a critical step in RNA sequencing (RNA-seq) analysis, aiming to remove systematic effects from the data to ensure that technical biases have minimal impact on the results. Analyzing numerous RNA-seq datasets, we detected a prevalent sample-specific length effect that leads to a strong association between gene length and fold-change estimates between samples. This stochastic sample-specific effect is not corrected by common normalization methods, including reads per kilobase of transcript length per million reads (RPKM), Trimmed Mean of M values (TMM), relative log expression (RLE), and quantile and upper-quartile normalization. Importantly, we demonstrate that this bias causes recurrent false positive calls by gene-set enrichment analysis (GSEA) methods, thereby leading to frequent functional misinterpretation of the data. Gene sets characterized by markedly short genes (e.g., ribosomal protein genes) or long genes (e.g., extracellular matrix genes) are particularly prone to such false calls. This sample-specific length bias is effectively removed by the conditional quantile normalization (cqn) and EDASeq methods, which allow the integration of gene length as a sample-specific covariate. Consequently, using these normalization methods led to substantial reduction in GSEA false results while retaining true ones. In addition, we found that application of gene-set tests that take into account gene–gene correlations attenuates false positive rates caused by the length bias, but statistical power is reduced as well. Our results advocate the inspection and correction of sample-specific length biases as default steps in RNA-seq analysis pipelines and reiterate the need to account for intergene correlations when performing gene-set enrichment tests to lessen false interpretation of transcriptomic data.

Highlights

  • We analyzed the original gene-level summaries as produced by the authors of these 35 datasets and found similar results, further precluding the possibility that the unexpected link we observed between gene length and FC is caused by any specific data-processing pipeline or any flaw in the analysis

  • As gene length is associated with biological function (e.g., extracellular matrix (ECM) genes, like collagens and integrins, are notably long, whereas housekeeping genes are markedly short [27]), we suspected that the technical coupling that we observed between gene length and differential expression would result in gene-set enrichment analysis (GSEA) false

  • Data underlying the results presented in this figure are provided in S3 Data. cqn, conditional quantile normalization; ECM, extracellular matrix; epithelial to mesenchymal states (EMT), epithelial–mesenchymal transition; FDR, false discovery rate; GO, Gene Ontology; GSEA, gene-set enrichment analysis; NES, normalized enrichment score; RNA-seq, RNA sequencing; RPKM, reads per kilobase of transcript length per million reads

Read more

Summary

Introduction

The ranked gene list is tested against a large number of curated gene sets, seeking those whose genes are significantly concentrated at either end of the expression list (each end represents, respectively, induced and repressed genes) This powerful method builds on the amplification of weak signals, achieved by considering the coordinated response of many genes that function in the same process, in which individually most of them show only mild change in expression that does not reach statistical significance in per-gene tests. This increased sensitivity makes GSEA tests especially susceptible to false positive calls that stem from mild experimental artifacts. As gene length is associated with biological function (e.g., ECM genes, like collagens and integrins, are notably long, whereas housekeeping genes are markedly short [27]), we suspected that the technical coupling that we observed between gene length and differential expression would result in GSEA false

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call