Abstract

The study of RNA expression is the fastest growing area of genomic research. However, despite the dramatic increase in the number of sequenced transcriptomes, we still do not have accurate estimates of the number and expression levels of non-coding RNA genes. Non-coding transcripts are often overlooked due to incomplete genome annotation. In this study, we use annotation-independent detection of RNA reads generated using a reverse transcriptase with low structure bias to identify non-coding RNA. Transcripts between 20 and 500 nucleotides were filtered and crosschecked with non-coding RNA annotations revealing 111 non-annotated non-coding RNAs expressed in different cell lines and tissues. Inspecting the sequence and structural features of these transcripts indicated that 60% of these transcripts correspond to new snoRNA and tRNA-like genes. The identified genes exhibited features of their respective families in terms of structure, expression, conservation and response to depletion of interacting proteins. Together, our data reveal a new group of RNA that are difficult to detect using standard gene prediction and RNA sequencing techniques, suggesting that reliance on actual gene annotation and sequencing techniques distorts the perceived architecture of the human transcriptome.

Highlights

  • Gene annotation is the blueprint of the human genome upon which gene expression analyses are performed [1,2]

  • To identify non-annotated non-coding RNAs (ncRNAs) genes, ribodepleted non-fragmented RNA extracted from the ovarian cancer cell line SKOV3IP1 was sequenced using thermostable group II intron reverse transcriptase (TGIRT)-seq (Figure 1A), which enables full length sequencing of RNA varying between 20 and 500 nucleotides in length [16]

  • RNA in RNAcentral obtained from sources that do not provide genomic coordinates such as ENA [46], or that are only present in a single specialized database with little experimental validation such as snoRNA Atlas [47] and piRNABank [48] entries were considered previously nonannotated

Read more

Summary

Introduction

Gene annotation is the blueprint of the human genome upon which gene expression analyses are performed [1,2]. The overall number of protein coding genes in the human genome stands at ∼20 000 with little change in their annotation in the past ten years [3,6]. Standard RNA sequencing pipelines use general gene annotation sets from databases like GENCODE [4] and RefSeq [5] to assign sequencing reads to specific genes, enabling the estimation of their expression. As a result, these pipelines discard aligned reads that map to non-annotated genes. For this reason, having a complete annotation set is essential to correctly evaluate the transcriptomic landscape with RNA-Seq data

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.