Abstract
BackgroundLong non-coding RNAs (lncRNAs) are emerging as important regulators of various biological processes. While many studies have exploited public resources such as RNA sequencing (RNA-Seq) data in The Cancer Genome Atlas to study lncRNAs in cancer, it is crucial to choose the optimal method for accurate expression quantification.ResultsIn this study, we compared the performance of pseudoalignment methods Kallisto and Salmon, alignment-based transcript quantification method RSEM, and alignment-based gene quantification methods HTSeq and featureCounts, in combination with read aligners STAR, Subread, and HISAT2, in lncRNA quantification, by applying them to both un-stranded and stranded RNA-Seq datasets. Full transcriptome annotation, including protein-coding and non-coding RNAs, greatly improves the specificity of lncRNA expression quantification. Pseudoalignment methods and RSEM outperform HTSeq and featureCounts for lncRNA quantification at both sample- and gene-level comparison, regardless of RNA-Seq protocol type, choice of aligners, and transcriptome annotation. Pseudoalignment methods and RSEM detect more lncRNAs and correlate highly with simulated ground truth. On the contrary, HTSeq and featureCounts often underestimate lncRNA expression. Antisense lncRNAs are poorly quantified by alignment-based gene quantification methods, which can be improved using stranded protocols and pseudoalignment methods.ConclusionsConsidering the consistency with ground truth and computational resources, pseudoalignment methods Kallisto or Salmon in combination with full transcriptome annotation is our recommended strategy for RNA-Seq analysis for lncRNAs.
Highlights
Long non-coding RNAs are a diverse class of RNA molecules that are more than 200 nucleotides in length and do not encode proteins [1]
We demonstrate that using full transcriptome annotation in RNA sequencing (RNA-Seq) analysis is strongly recommended as it greatly improves the specificity of Long non-coding RNAs (lncRNAs) quantification
It is worth noting that the default workflow for The Cancer Genome Atlas (TCGA) RNA-Seq data stored in Genomic Data Commons (GDC) data portal uses HTSeq, an alignment-based method
Summary
Long non-coding RNAs (lncRNAs) are a diverse class of RNA molecules that are more than 200 nucleotides in length and do not encode proteins [1]. LncRNAs have recently emerged as an essential class of regulatory elements for many biological processes including imprinting, cell differentiation, and development [3] They are often disrupted in human diseases including cancer [4]. They may interact with DNA, RNA, and proteins, and exert regulatory roles through a variety of mechanisms. Based on their molecular functions, lncRNA may act as i) signals, which are indicators of transcriptional activity; ii) decoys, which bind to and titrate away protein targets such as transcription factors; iii) guides, which direct regulatory complexes or transcription factors to specific targets and regulate gene expression in cis or trans, and iv) scaffolds, which serve as central platforms where relevant molecular components in cells are assembled [5]. The lincRNA MEG3 inhibits cell proliferation by downregulating MDM2 and promoting p53 accumulation [11]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.