Abstract

The protein coding sequences of the human reference genome GRCh38, RefSeq mRNA and UniProt protein databases are sometimes inconsistent with each other, due to polymorphisms in the human population, but the overall landscape of the discordant sequences has not been clarified. In this study, we comprehensively listed the discordant bases and regions between the GRCh38, RefSeq and UniProt reference sequences, based on the genomic coordinates of GRCh38. We observed that the RefSeq sequences are more likely to represent the major alleles than GRCh38 and UniProt, by assigning the alternative allele frequencies of the discordant bases. Since some reference sequences have minor alleles, functional and structural annotations may be performed based on rare alleles in the human population, thereby biasing these analyses. Some of the differences between the RefSeq and GRCh38 account for biological differences due to known RNA-editing sites. The definitions of the coding regions are frequently complicated by possible micro-exons within introns and by SNVs with large alternative allele frequencies near exon–intron boundaries. The mRNA or protein regions missing from GRCh38 were mainly due to small deletions, and these sequences need to be identified. Taken together, our results clarify overall consistency and remaining inconsistency between the reference sequences.

Highlights

  • Accurate sequences of the human genome and genes are fundamental resources for functional genomics and translational medicine

  • The number of mRNAs classified for each class by each method is summarized in Supplementary Table S1, and the classification inconsistencies between University of California Santa Cruz (UCSC) and consensus CDS (CCDS) are discussed in the Supplementary Discussion

  • Three synonymous and four non-synonymous substitutions match triallelic sites reported in the 1K genomes, in which RefSeq represents one of the two alternative alleles. These results suggest that the majority of the differences between GRCh38 and RefSeq mRNA originate from natural variations in the human genome

Read more

Summary

Introduction

Accurate sequences of the human genome and genes are fundamental resources for functional genomics and translational medicine. Many other groups have published alternative human genome sequences of separate individuals [4, 5], including those from different ethnicities [6, 7] and from the haploid genome of a hydatidiform mole [8], the reference genome has distinguished accuracy and coverage for difficult regions, including repeats and segmental duplications [9]. These alternative genomes have played important roles in refining the reference genome, by identifying its erroneous regions [5, 8]. The latest reference genome, GRCh38, was published in 2013 and includes the results of many high-throughput sequencing efforts [17]

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.