Abstract

BackgroundThe non-reference sequences (NRS) represent structure variations in human genome with potential functional significance. However, besides the known insertions, it is currently unknown whether other types of structure variations with NRS exist.ResultsHere, we compared 31 human de novo assemblies with the current reference genome to identify the NRS and their location. We resolved the precise location of 6113 NRS adding up to 12.8 Mb. Besides 1571 insertions, we detected 3041 alternate alleles, which were defined as having less than 90% (or none) identity with the reference alleles. These alternate alleles overlapped with 1143 protein-coding genes including a putative novel MHC haplotype. Further, we demonstrated that the alternate alleles and their flanking regions had high content of tandem repeats, indicating that their origin was associated with tandem repeats.ConclusionsOur study detected a large number of NRS including many alternate alleles which are previously uncharacterized. We suggested that the origin of alternate alleles was associated with tandem repeats. Our results enriched the spectrum of genetic variations in human genome.

Highlights

  • The non-reference sequences (NRS) represent structure variations in human genome with potential functional significance

  • After removing redundant sequences and sequences shorter than 400 bp, we obtained the unaligned sequences from each assembly, which were merged into a unified non-reference call set of 15,055 sequences adding up to 129.1 Mb with a median length of 2848 bp (N50 = 1066 bp)

  • The majority of the alternate alleles that we found have not been previously reported, which could be due to several reasons: (1) Most previous work has designed their studies to focus on insertions, whereas other types of NRS were largely ignored [3, 5, 13]; (2) Many studies have mainly relied on short-reads data to obtain NRS [3, 5], which would be less efficient for the discovery of long structural variations compared with an assembly-versus-assembly approach, as applied in our study; and (3) Many alternate alleles were singletons, suggesting that they are either of very low frequency for detection or false positives due to assembly errors

Read more

Summary

Introduction

The non-reference sequences (NRS) represent structure variations in human genome with potential functional significance. Despite introduction of alternate alleles for a graphbased representation, the current reference genome is largely derived from a single individual of AfricanEuropean origin [2], limiting its representation of diverse populations. Lines of evidence in recent years have revealed that individuals still carry sequences that are not represented in the reference genome. These sequences could be an important type of structural variation underlying disease associations or complex traits [3]. The discovery of non-reference sequences (NRS) will be a prerequisite for an more complete graph-based genome, thereby enabling improved genomic analyses and understanding of genomic architecture [4].

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call