Abstract

While most transcriptome analyses in high-throughput clinical studies focus on gene level expression, the existence of alternative isoforms of gene transcripts is a major source of the diversity in the biological functionalities of the human genome. It is, therefore, essential to annotate isoforms of gene transcripts for genome-wide transcriptome studies. Recently developed mRNA sequencing technology presents an unprecedented opportunity to discover new forms of transcripts, and at the same time brings bioinformatic challenges due to its short read length and incomplete coverage for the transcripts. In this work, we proposed a computational approach to reconstruct new mRNA transcripts from short sequencing reads with reference information of known transcripts in existing databases. The prior knowledge helped to define exon boundaries and fill in the transcript regions not covered by sequencing data. This approach was demonstrated using a deep sequencing data set of human muscle tissue with transcript annotations in RefSeq as prior knowledge. We identified 2,973 junctions, 7,471 exons, and 7,571 transcripts not previously annotated in RefSeq. 73% of these new transcripts found supports from UCSC Known Genes, Ensembl or EST transcript annotations. In addition, the reconstructed transcripts were much longer than those from de novo approaches that assume no prior knowledge. These previously un-annotated transcripts can be integrated with known transcript annotations to improve both the design of microarrays and the follow-up analyses of isoform expression. The overall results demonstrated that incorporating transcript annotations from genomic databases significantly helps the reconstruction of novel transcripts from short sequencing reads for transcriptome research.

Highlights

  • In large-scale clinical studies, most existing data sets focus on gene expression profiles; human transcriptome is undoubtedly more complex

  • Coverage of the RNA sequencing (RNA-Seq) data 203 million RNA-Seq reads from human muscle tissue were mapped over annotated exon and junction regions collected from RefSeq, Ensembl, UCSC Known Genes and expressed sequence tags (ESTs) databases, and 120 million reads were uniquely mapped by allowing up to 2 mismatches

  • We proposed a knowledge-based approach to reconstruct new mRNA transcripts from short sequencing reads

Read more

Summary

Introduction

In large-scale clinical studies, most existing data sets focus on gene expression profiles; human transcriptome is undoubtedly more complex. More than 90% of genes are shown to undergo alternative splicing [3,4], and many disease-causing mutations introduce alternative mRNA transcripts [5]. It is, of great importance to effectively measure the levels of gene isoforms in human health and diseases. An emerging approach for large-scale clinical studies is, to first sequence with a sufficient depth to comprehensively identify the mRNA transcriptome of the disease, followed by the design of customized microarrays targeting these transcripts as well as by high-throughput screening of thousands of patient samples using the arrays [6]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call