Abstract

Abstract Introduction Long non-coding RNAs (lncRNAs) emerge as important regulators of various biological processes. Many lncRNAs with tumor-suppressor or oncogenic functions in cancer have been discovered. While many studies have exploited public resources such as RNA-Seq data in The Cancer Genome Atlas (TCGA) to study lncRNAs in cancer, it is crucial to choose the optimal method for accurate lncRNAs expression quantification. Multiples tools for processing RNA-Seq data have spurred in recent years, however, there is no accepted gold standard pipeline yet for optimal quantification of lncRNAs. Therefore, we aim to evaluate the performance of popular RNA-Seq analysis tools and recommend the best practice for RNA-Seq analysis of lncRNAs. Methods In this benchmarking study, we compared the performance of pseudoalignment methods Kallisto and Salmon, and alignment-based methods HTSeq, featureCounts, and RSEM, by applying them to a simulated RNA-Seq dataset with 63 samples, and a pan-cancer RNA-Seq dataset with 210 samples from TCGA. GENCODE release 25 were used as transcriptome reference. All the scripts were put in GitHub (zhengh42/RNASeq_pipeline). Results Pseudoalignment methods Kallisto and Salmon detect more lncRNAs than alignment-based methods and correlate highly with simulated ground truth. On the contrary, alignment-based methods tend to underestimate lncRNA expression or even fail to capture lncRNA signal in the ground truth. These underestimated genes include several cancer-relevant lncRNAs such as TERC and ZEB2-AS1. Besides the high concordance with ground truth, pseudoalignment methods take less CPU time per sample. They are also flexible with both gene-level and transcript-level quantification, while HTSeq and featureCounts are suitable for gene-level, but not transcript-level analysis. Overall, 10-16% of lncRNAs can be detected in the samples, with antisense and lincRNAs the two most abundant categories. A higher proportion of antisense RNAs are detected than lincRNAs. Moreover, among the expressed lncRNAs, more antisense RNAs are discordant from ground truth (Spearman's correlation less than 0.7) than lincRNAs when measured by alignment-based methods, indicating that antisense RNAs are more susceptible to mis-quantification. In addition, the lncRNAs with fewer transcripts, less than three exons, and lower sequence uniqueness tend to be more discordant. Finally, incomplete annotation overestimates expression of both lncRNAs and protein-coding genes. Full transcriptome annotation, including both protein coding and noncoding RNAs, greatly improves the specificity of lncRNA expression quantification. Conclusions In summary, considering the concordance with ground truth, flexibility with both genes and transcripts analysis, and the running time, pseudoalignment methods Kallisto or Salmon in combination with the full transcriptome annotation is our recommended strategy for RNA-Seq analysis for lncRNAs. Citation Format: Hong Zheng, Mikel Hernaez, Kevin Brennan, Olivier Gevaert. Benchmark of lncRNA quantification in RNA-Seq of cancer samples [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2018; 2018 Apr 14-18; Chicago, IL. Philadelphia (PA): AACR; Cancer Res 2018;78(13 Suppl):Abstract nr 2472.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.