Abstract
BackgroundRNA sequencing (RNA-Seq) has emerged as a powerful approach for the detection of differential gene expression with both high-throughput and high resolution capabilities possible depending upon the experimental design chosen. Multiplex experimental designs are now readily available, these can be utilised to increase the numbers of samples or replicates profiled at the cost of decreased sequencing depth generated per sample. These strategies impact on the power of the approach to accurately identify differential expression. This study presents a detailed analysis of the power to detect differential expression in a range of scenarios including simulated null and differential expression distributions with varying numbers of biological or technical replicates, sequencing depths and analysis methods.ResultsDifferential and non-differential expression datasets were simulated using a combination of negative binomial and exponential distributions derived from real RNA-Seq data. These datasets were used to evaluate the performance of three commonly used differential expression analysis algorithms and to quantify the changes in power with respect to true and false positive rates when simulating variations in sequencing depth, biological replication and multiplex experimental design choices.ConclusionsThis work quantitatively explores comparisons between contemporary analysis tools and experimental design choices for the detection of differential expression using RNA-Seq. We found that the DESeq algorithm performs more conservatively than edgeR and NBPSeq. With regard to testing of various experimental designs, this work strongly suggests that greater power is gained through the use of biological replicates relative to library (technical) replicates and sequencing depth. Strikingly, sequencing depth could be reduced as low as 15% without substantial impacts on false positive or true positive rates.
Highlights
RNA sequencing (RNA-Seq) has emerged as a powerful approach for the detection of differential gene expression with both high-throughput and high resolution capabilities possible depending upon the experimental design chosen
Comparisons of statistical methods: edgeR, DESeq, and NBPSeq using simulated data under the null To test the performance of each package under the null hypothesis, we simulated sets of n “control” and n “treatment” lanes of counts in accordance with the procedure described in the Methods section, for a range of values of n and with no differential expression (DE) between treatments
Shown are false positive rate (FPR) for high-count transcripts (> 100 counts averaged across biological replicates) and low-count transcripts (≤ 100 counts averaged across biological replicates)
Summary
RNA sequencing (RNA-Seq) has emerged as a powerful approach for the detection of differential gene expression with both high-throughput and high resolution capabilities possible depending upon the experimental design chosen. Multiplex experimental designs are readily available, these can be utilised to increase the numbers of samples or replicates profiled at the cost of decreased sequencing depth generated per sample These strategies impact on the power of the approach to accurately identify differential expression. RNA-Seq analysis is vulnerable to the general biases and errors inherent in the next-generation sequencing (NGS) technology upon which it is based These errors and biases include: sequencing errors (wrong base calls), biases in sequence quality, nucleotide composition and error rates relative to the base position in the read [9,10], variability in sequence depth across the transcriptome due to preferential sites of fragmentation, variable primer and transcript nucleotide composition effects [11] and differences in the coverage and composition of raw sequence data generated from technical and biological replicate samples [1,12]
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have