Abstract

BackgroundLong non-coding RNAs (lncRNAs) are typically expressed at low levels and are inherently highly variable. This is a fundamental challenge for differential expression (DE) analysis. In this study, the performance of 25 pipelines for testing DE in RNA-seq data is comprehensively evaluated, with a particular focus on lncRNAs and low-abundance mRNAs. Fifteen performance metrics are used to evaluate DE tools and normalization methods using simulations and analyses of six diverse RNA-seq datasets.ResultsGene expression data are simulated using non-parametric procedures in such a way that realistic levels of expression and variability are preserved in the simulated data. Throughout the assessment, results for mRNA and lncRNA were tracked separately. All the pipelines exhibit inferior performance for lncRNAs compared to mRNAs across all simulated scenarios and benchmark RNA-seq datasets. The substandard performance of DE tools for lncRNAs applies also to low-abundance mRNAs. No single tool uniformly outperformed the others. Variability, number of samples, and fraction of DE genes markedly influenced DE tool performance.ConclusionsOverall, linear modeling with empirical Bayes moderation (limma) and a non-parametric approach (SAMSeq) showed good control of the false discovery rate and reasonable sensitivity. Of note, for achieving a sensitivity of at least 50%, more than 80 samples are required when studying expression levels in realistic settings such as in clinical cancer research. About half of the methods showed a substantial excess of false discoveries, making these methods unreliable for DE analysis and jeopardizing reproducible science. The detailed results of our study can be consulted through a user-friendly web application, giving guidance on selection of the optimal DE tool (http://statapps.ugent.be/tools/AppDGE/).

Highlights

  • Long non-coding Ribonucleic acid (RNA) are typically expressed at low levels and are inherently highly variable

  • The discovery and study of Long non-coding RNA (lncRNA) is of major relevance to human health and disease because they represent an extensive, largely unexplored, and functional component of the genome [3,4,5]

  • Several gene expression studies indicated that the expression of the majority of lncRNAs is characterized by low abundance [2, 7, 9], high noise [8], and tissue-specific expression [7]

Read more

Summary

Introduction

Long non-coding RNAs (lncRNAs) are typically expressed at low levels and are inherently highly variable. Attention is expanding to one of the most poorly understood, yet most common RNA species: long non-coding RNAs (lncRNAs) [1, 2] These lncRNAs. Assefa et al Genome Biology (2018) 19:96 which is a characteristic shared with low count data from massively parallel RNA sequencing. We evaluated and compared the performance of many popular statistical methods (Table 1) developed for testing DGE of RNA-seq data (hereafter referred to as “DE tools”), with special emphasis on lncRNAs and low-abundance mRNAs. All tools considered in this study are popular (in terms of number of citations), available as R software packages [12], and use gene or transcript level read counts as input. Our conclusions are based on six RNA-seq datasets and many realistic simulations, representing various typical gene expression experiments

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.