A comprehensive evaluation of ensembl, RefSeq, and UCSC annotations in the context of RNA-seq read mapping and gene quantification.

Shanrong Zhao,Baohong Zhang

doi:10.1186/s12864-015-1308-8

Abstract

BackgroundRNA-Seq has become increasingly popular in transcriptome profiling. One aspect of transcriptome research is to quantify the expression levels of genomic elements, such as genes, their transcripts and exons. Acquiring a transcriptome expression profile requires genomic elements to be defined in the context of the genome. Multiple human genome annotation databases exist, including RefGene (RefSeq Gene), Ensembl, and the UCSC annotation database. The impact of the choice of an annotation on estimating gene expression remains insufficiently investigated.ResultsIn this paper, we systematically characterized the impact of genome annotation choice on read mapping and transcriptome quantification by analyzing a RNA-Seq dataset generated by the Human Body Map 2.0 Project. The impact of a gene model on mapping of non-junction reads is different from junction reads. For the RNA-Seq dataset with a read length of 75 bp, on average, 95% of non-junction reads were mapped to exactly the same genomic location regardless of which gene models was used. By contrast, this percentage dropped to 53% for junction reads. In addition, about 30% of junction reads failed to align without the assistance of a gene model, while 10–15% mapped alternatively. There are 21,958 common genes among RefGene, Ensembl, and UCSC annotations. When we compared the gene quantification results in RefGene and Ensembl annotations, 20% of genes are not expressed, and thus have a zero count in both annotations. Surprisingly, identical gene quantification results were obtained for only 16.3% (about one sixth) of genes. Approximately 28.1% of genes’ expression levels differed by 5% or higher, and of those, the relative expression levels for 9.3% of genes (equivalent to 2038) differed by 50% or greater. The case studies revealed that the gene definition differences in gene models frequently result in inconsistency in gene quantification.ConclusionsWe demonstrated that the choice of a gene model has a dramatic effect on both gene quantification and differential analysis. Our research will help RNA-Seq data analysts to make an informed choice of gene model in practical RNA-Seq data analysis.Electronic supplementary materialThe online version of this article (doi:10.1186/s12864-015-1308-8) contains supplementary material, which is available to authorized users.

Highlights

IntroductionOne aspect of transcriptome research is to quantify the expression levels of genomic elements, such as genes, their transcripts and exons
RNA-Seq has become increasingly popular in transcriptome profiling
We assessed the impact of gene models on the mapping of junction and non-junction reads, and compared the impact of genome annotation choice on gene quantification and differential analysis

Summary

Introduction

One aspect of transcriptome research is to quantify the expression levels of genomic elements, such as genes, their transcripts and exons. RNA-Seq, the sequencing of a population of RNA transcripts using high-throughput sequencing technologies, profiles an entire transcriptome at single-base resolution whilst concurrently quantifying gene expression levels [1,2,3,4,5]. Accurate alignment of high-throughput short RNA-Seq reads remains challenging, mainly because of junction (i.e., exon-exon spanning) reads and the ambiguity of multiple-mapping reads. In our previous study [20], we had assessed the impact of using RefGene (RefSeq Gene) [21] on mapping short RNA-Seq reads, and demonstrated that without the assistance of RefGene, more than one third of junction reads failed to map to the reference genome in the alignment process

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Genomics	Publication Date: Feb 18, 2015
Citations: 147	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A comprehensive evaluation of ensembl, RefSeq, and UCSC annotations in the context of RNA-seq read mapping and gene quantification.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics

Lead the way for us

Similar Papers

Impact of Gene Annotation on RNA-seq Data Analysis
Shanrong Zhao ... Baohong Zhang
-
Shanrong Zhao, et. al.Shanrong Zhao ... Baohong Zhang
14 Jan 2016
14 Jan 2016

Impact of gene annotation choice on the quantification of RNA-seq data
David Chisanga ... Wei Shi
BMC Bioinformatics | VOL. 23
David Chisanga, et. al.David Chisanga ... Wei Shi
30 Mar 2022
BMC Bioinformatics | VOL. 23

High-throughput long noncoding RNA profiling for diagnostic and prognostic markers in cancer: opportunities and challenges.
Zhifu Sun
Epigenomics | VOL. 7
Zhifu SunZhifu Sun
01 Oct 2015
Epigenomics | VOL. 7

Long and Repeat-Rich Intronic Sequences Favor Circular RNA Formation under Conditions of Reduced Spliceosome Activity.
Mantian Wang ... Erin M Schuman
iScience | VOL. 20
Mantian Wang, et. al.Mantian Wang ... Erin M Schuman
06 Sep 2019
iScience | VOL. 20

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A comprehensive evaluation of ensembl, RefSeq, and UCSC annotations in the context of RNA-seq read mapping and gene quantification.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics