Abstract

BackgroundThe complete and accurate human reference genome is important for functional genomics researches. Therefore, the incomplete reference genome and individual specific sequences have significant effects on various studies.Resultswe used two RNA-Seq datasets from human brain tissues and 10 mixed cell lines to investigate the completeness of human reference genome. First, we demonstrated that in previously identified ~5 Mb Asian and ~5 Mb African novel sequences that are absent from the human reference genome of NCBI build 36, ~211 kb and ~201 kb of them could be transcribed, respectively. Our results suggest that many of those transcribed regions are not specific to Asian and African, but also present in Caucasian. Then, we found that the expressions of 104 RefSeq genes that are unalignable to NCBI build 37 in brain and cell lines are higher than 0.1 RPKM. 55 of them are conserved across human, chimpanzee and macaque, suggesting that there are still a significant number of functional human genes absent from the human reference genome. Moreover, we identified hundreds of novel transcript contigs that cannot be aligned to NCBI build 37, RefSeq genes and EST sequences. Some of those novel transcript contigs are also conserved among human, chimpanzee and macaque. By positioning those contigs onto the human genome, we identified several large deletions in the reference genome. Several conserved novel transcript contigs were further validated by RT-PCR.ConclusionOur findings demonstrate that a significant number of genes are still absent from the incomplete human reference genome, highlighting the importance of further refining the human reference genome and curating those missing genes. Our study also shows the importance of de novo transcriptome assembly. The comparative approach between reference genome and other related human genomes based on the transcriptome provides an alternative way to refine the human reference genome.

Highlights

  • The complete and accurate human reference genome is important for functional genomics researches

  • Detecting transcribed regions in Asian and African novel sequences We used two transcriptome sequencing datasets from two reference RNA samples established by the MicroArray Quality Control (MAQC) project [18] with Illumina next-generation sequencing technology to carry out our study (Figure 1)

  • Using 90% identity and 98% coverage as threshold, we found that only 991.5 kb (19.35%) Asian (YH) and 926.7 kb (19.31%) African (NA18507) novel sequences could be aligned to GRCh37, and the rest were still unalignable to GRCh37

Read more

Summary

Introduction

The complete and accurate human reference genome is important for functional genomics researches. Khaja et al [11] and Kidd et al [12] have reported that a notable portion of human genomic sequences were absent from NCBI build 35 or build 36, suggesting that the updated human reference genome is still not completely assembled and annotated. Some of those unmapped reads could be generated from certain functional genes that not present in the reference genome. Discarding such sequences can result in the loss of important information

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call