Abstract

Ab initio assembly of transcriptome sequencing data has been widely used to identify large intergenic non-coding RNAs (lincRNAs), a novel class of gene regulators involved in many biological processes. To differentiate real lincRNA transcripts from thousands of assembly artifacts, a series of filtering steps such as filters of transcript length, expression level and coding potential, need to be applied. However, an easy-to-use and publicly available bioinformatics pipeline that integrates these filters is not yet available. Hence, we implemented sebnif, an integrative bioinformatics pipeline to facilitate the discovery of bona fide novel lincRNAs that are suitable for further functional characterization. Specifically, sebnif is the only pipeline that implements an algorithm for identifying high-quality single-exonic lincRNAs that were often omitted in many studies. To demonstrate the usage of sebnif, we applied it on a real biological RNA-seq dataset from Human Skeletal Muscle Cells (HSkMC) and built a novel lincRNA catalog containing 917 highly reliable lincRNAs. Sebnif is available at http://sunlab.lihs.cuhk.edu.hk/sebnif/.

Highlights

  • Recent advances in transcriptome sequencing have led to the identification of many lincRNA transcripts (.200 nucleotides) [1,2,3] that localize in the intergenic region of protein coding genes

  • These transcripts have very weak or no coding potential for any protein products; their expression levels are generally lower than that of mRNAs are often mistakenly considered as transcriptional noises; many of them are transcribed by Polymerase II (Pol II) and spliced like mRNAs while a significant portion of them remain as single-exonic transcripts [3,4]

  • Sebnif can use the output of Cufflinks directly since it is in General Feature Format (GFF)/Gene Transfer Format (GTF) format; for Scripture, which outputs files in Browser Extensible Data (BED) format, sebnif implements a utility program to convert it to GFF/GTF format

Read more

Summary

Introduction

Recent advances in transcriptome sequencing have led to the identification of many lincRNA transcripts (.200 nucleotides) [1,2,3] that localize in the intergenic region of protein coding genes (mRNAs). A widely used approach is to apply several filters, such as filters of transcript length, expression level and coding potential, to remove these artifacts step by step [1,7,8] This multi-filtering approach has been proven effective in discovering thousands of novel multi-exonic lincRNAs in various systems [1,7,8,9]. A bioinformatics pipeline, which integrates these filtering steps, is not yet publicly available To fill these gaps, we designed and implemented an integrative bioinformatics pipeline named sebnif (Self-Estimation Based Novel LincRNA Filtering pipeline) to facilitate the identification of both multi- and single-exonic lincRNAs. To illustrate its usage and performance, we applied it on a RNA-seq dataset from Human Skeletal Muscle Cells (HSkMC) to build a lincRNA catalog. Further analysis of these novel lincRNAs reveals their specific genomic distribution pattern and potential functions

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call