Abstract

The current human reference sequence (GRCh38) is a foundation for large-scale sequencing projects. However, recent studies have suggested that GRCh38 may be incomplete and give a suboptimal representation of specific population groups. Here, we performed a de novo assembly of two Swedish genomes that revealed over 10 Mb of sequences absent from the human GRCh38 reference in each individual. Around 6 Mb of these novel sequences (NS) are shared with a Chinese personal genome. The NS are highly repetitive, have an elevated GC-content, and are primarily located in centromeric or telomeric regions. Up to 1 Mb of NS can be assigned to chromosome Y, and large segments are also missing from GRCh38 at chromosomes 14, 17, and 21. Inclusion of NS into the GRCh38 reference radically improves the alignment and variant calling from short-read whole-genome sequencing data at several genomic loci. A re-analysis of a Swedish population-scale sequencing project yields > 75,000 putative novel single nucleotide variants (SNVs) and removes > 10,000 false positive SNV calls per individual, some of which are located in protein coding regions. Our results highlight that the GRCh38 reference is not yet complete and demonstrate that personal genome assemblies from local populations can improve the analysis of short-read whole-genome sequencing data.

Highlights

  • Due to advances in DNA sequencing technologies, whole genome sequencing (WGS) has become an established method to study human genetic variation at a population scale

  • The current GRCh38 reference might not be optimal in the context of population specific WGS projects, and more information could be gained from WGS data by instead using local references genomes, tailored to a specific country or population

  • The two individuals were unrelated and selected from the 1000 samples included in SweGen, which is a project where the genetic variation in a cross-section of the Swedish population was studied using Illumina WGS [1]

Read more

Summary

Introduction

Due to advances in DNA sequencing technologies, whole genome sequencing (WGS) has become an established method to study human genetic variation at a population scale. The vast majority of human WGS is performed using short-read Illumina sequencing technology, and requires an alignment of the sequence reads to a human reference sequence. The gold standard reference is the GRCh38 release from 2013, which is based on DNA from multiple donors and intended to represent a pan-human genome, rather than a single individual or population group [9]. The de novo assembly of 150 Danish individuals based on Illumina mate-pair sequencing have strengthened the hypothesis that regional reference genomes can increase the power of association studies and improve precision medicine [10]. Since Illumina’s technology is limited by short read lengths and amplification biases [11], it is not a viable alternative for creating human de novo assemblies comparable to GRCh38 in terms of completeness and contiguity

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call