Abstract
Deep sequencing of transcriptomes has become an indispensable tool for biology, enabling expression levels for thousands of genes to be compared across multiple samples. Since transcript counts scale with sequencing depth, counts from different samples must be normalized to a common scale prior to comparison. We analyzed fifteen existing and novel algorithms for normalizing transcript counts, and evaluated the effectiveness of the resulting normalizations. For this purpose we defined two novel and mutually independent metrics: (1) the number of “uniform” genes (genes whose normalized expression levels have a sufficiently low coefficient of variation), and (2) low Spearman correlation between normalized expression profiles of gene pairs. We also define four novel algorithms, one of which explicitly maximizes the number of uniform genes, and compared the performance of all fifteen algorithms. The two most commonly used methods (scaling to a fixed total value, or equalizing the expression of certain ‘housekeeping’ genes) yielded particularly poor results, surpassed even by normalization based on randomly selected gene sets. Conversely, seven of the algorithms approached what appears to be optimal normalization. Three of these algorithms rely on the identification of “ubiquitous” genes: genes expressed in all the samples studied, but never at very high or very low levels. We demonstrate that these include a “core” of genes expressed in many tissues in a mutually consistent pattern, which is suitable for use as an internal normalization guide. The new methods yield robustly normalized expression values, which is a prerequisite for the identification of differentially expressed and tissue-specific genes as potential biomarkers.
Highlights
Modern sequencing technologies have enabled measurement of gene expression by ‘‘digital transcript counting’’
We present insights derived from the study of many normalization algorithms, including novel methods, with an emphasis on data driven approaches
We present a framework for evaluating how successful normalization methods are at rendering the gene expression levels comparable across samples, without relying on specific applications
Summary
Modern sequencing technologies have enabled measurement of gene expression by ‘‘digital transcript counting’’. Transcript counting has a number of compelling advantages, including high sensitivity and the ability to discover previously unknown transcripts [1]. Transcript counting started with the original Serial Analysis of Gene Expression method (SAGE) [2], gained momentum with Massively Parallel Signature Sequencing (MPSS) [3] and is coming to maturity with the application of ‘‘ generation’’ high throughput sequencing technologies. The modern RNA-seq technology [4] sequences the full extent of each transcript, and has the added advantage of being able to characterize alternative splice forms for the same gene [5]. Alternative splicing can be cell type-specific, tissuespecific, sex-specific and lineage-specific [6]. SAGE-Seq was described [7], applying next-generation sequencing to obtaining SAGE-like data
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.