Abstract

The human genome harbors numerous structural variants (SVs) which, due to their repetitive nature, are currently underexplored in short-read whole-genome sequencing approaches. Using single-molecule, real-time (SMRT) long-read sequencing technology in combination with FALCON-Unzip, we generated a de novo assembly of the diploid genome of a 115-year-old Dutch cognitively healthy woman. We combined this assembly with two previously published haploid assemblies (CHM1 and CHM13) and the GRCh38 reference genome to create a compendium of SVs that occur across five independent human haplotypes using the graph-based multi-genome aligner REVEAL. Across these five haplotypes, we detected 31,680 euchromatic SVs (>50 bp). Of these, ~62% were comprised of repetitive sequences with ‘variable number tandem repeats’ (VNTRs), ~10% were mobile elements (Alu, L1, and SVA), while the remaining variants were inversions and indels. We observed that VNTRs with GC-content >60% and repeat patterns longer than 15 bp were 21-fold enriched in the subtelomeric regions (within 5 Mb of the ends of chromosome arms). VNTR lengths can expand to exceed a critical length which is associated with impaired gene transcription. The genes that contained most VNTRs, of which PTPRN2 and DLGAP2 are the most prominent examples, were found to be predominantly expressed in the brain and associated with a wide variety of neurological disorders. Repeat-induced variation represents a sizeable fraction of the genetic variation in human genomes and should be included in investigations of genetic factors associated with phenotypic traits, specifically those associated with neurological disorders. We make available the long and short-read sequence data of the supercentenarian genome, and a compendium of SVs as identified across 5 human haplotypes.

Highlights

  • Repetitive sequences give rise to a myriad of structural variants (SVs), and recent findings indicate that these might explain at least part of the missing heritability for many traits[1,2,3,4,5]

  • We found that genes that contained most variable number tandem repeats’ (VNTRs) were enriched for genes expressed in the brain, genes with multiple splice isoforms and genes associated with autism spectrum disorders

  • The two genes that contain the most VNTRs in our analysis, DLGAP2 and PTPRN2, are predominantly expressed in the brain and were previously associated with a wide range of different neurological phenotypes: rare copy-number variations (CNVs) in DLGAP2 were associated with the autism spectrum[50,51]; rare CNVs in PTPRN2 were associated with attention-deficit hyperactivity disorder52,53), GWAS markers in PTPRN2 were associated with schizophrenia/ bipolar disorder[54]; rare single-nucleotide variations in DLGAP2 were associated with schizophrenia[55] and linkage analysis of PTPRN2 gene identified an association with cocaine dependence/depression[56]

Read more

Summary

Introduction

Repetitive sequences give rise to a myriad of structural variants (SVs), and recent findings indicate that these might explain at least part of the missing heritability for many traits[1,2,3,4,5]. A recent report indicated that when the 25 nt subunit repeat sequence in the ABCA7 gene expands to exceed ~5200 nt, this associates with a ~4.5fold increased risk for Alzheimer’s disease[5]. The assessment of large repetitive regions is difficult because short 100–150 bp sequence-reads do not span the entire structural variant[10]. The solution to this problem is to generate longer sequencing reads. Various studies have shown that PacBio’s single-molecule, real-time (SMRT) long-read sequencing can be used to reveal large numbers of novel SVs in previously inaccessible regions of the human genome[10,11,12,13,14]

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.