Abstract

BackgroundThe most widely used human genome reference assembly hg19 harbors minor alleles at 2.18 million positions as revealed by 1000 Genome Phase 3 dataset. Although this is less than 2% of the 89 million variants reported, it has been shown that the minor alleles can result in 30% false positives in individual genomes, thus misleading and burdening downstream interpretation. More alarming is the fact that, significant percentage of variants that are homozygous recessive for these minor alleles, with potential disease implications, are masked from reporting.ResultsWe have demonstrated that the false positives (FP) and false negatives (FN) can be corrected for by simply replacing nucleotides at the minor allele positions in hg19 with corresponding major allele. Here, we have effectively replaced 2.18 million minor alleles Single Nucleotide Polymorphism (SNPs), Insertion and Deletions (INDELs), Multiple Nucleotide Polymorphism (MNPs) in hg19 with the corresponding major alleles to create an ethnically normalized reference genome called hg19KIndel. In doing so, hg19KIndel has both corrected for sequencing errors acknowledged to be present in hg19 and has improved read alignment near the minor alleles in hg19.ConclusionWe have created and made available a new version human reference genome called hg19KIndel. It has been shown that variant calling using hg19KIndel, significantly reduces false positives calls, which in-turn reduces the burden from downstream analysis and validation. It also improved false negative variants call, which means that the variants which were getting missed due to the presence of minor alleles in hg19, will now be called using hg19KIndel. Using hg19KIndel, one even gets a better mapping percentage when compared to currently available human reference genome. hg19KIndel reference genome and its auxiliary datasets are available at https://doi.org/10.5281/zenodo.2638113

Highlights

  • The most widely used human genome reference assembly hg19 harbors minor alleles at 2.18 million positions as revealed by 1000 Genome Phase 3 dataset

  • Creation of ethnically normalized genome: hg19KIndel As per the 1000 Genome phase-3 dataset (Phase-3), there are around 81.3 million Single Nucleotide Polymorphism (SNPs), 3.29 million Insertion and Deletions (INDELs) and ~ 60 thousand other variants including Multiple Nucleotide Polymorphism (MNPs) and structural variants when compared to hg19

  • Hg19 harbors a minor allele at these positions compared to the ethnically diverse individuals included in the phase-3 dataset

Read more

Summary

Introduction

The most widely used human genome reference assembly hg harbors minor alleles at 2.18 million positions as revealed by 1000 Genome Phase 3 dataset. One of the major goals of this effort, popularly known as The Human Genome Project, was to decipher the human proteome there by allowing for cataloging of all potential drug targets. This was the first effort to provide a complete and accurate order of the 3 billion DNA base pairs that make up the human genome [1, 2]. The assembly hg is currently widely used as a reference genome in our pursuit for mutations that causes/ predisposes one to various diseases; kick starting an era of personalized genomics or consumer genetics

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call