Abstract

Standard methods of DNA sequence analysis assume that sequences evolve independently, yet this assumption may not be appropriate for segmental duplications that exchange variants via interlocus gene conversion (IGC). Here, we use high quality multiple sequence alignments from well-annotated segmental duplications to systematically identify IGC signals in the human reference genome. Our analysis combines two complementary methods: (i) a paralog quartet method that uses DNA sequence simulations to identify a statistical excess of sites consistent with inter-paralog exchange, and (ii) the alignment-based method implemented in the GENECONV program. One-quarter (25.4%) of the paralog families in our analysis harbor clear IGC signals by the quartet approach. Using GENECONV, we identify 1477 gene conversion tracks that cumulatively span 1.54 Mb of the genome. Our analyses confirm the previously reported high rates of IGC in subtelomeric regions and Y-chromosome palindromes, and identify multiple novel IGC hotspots, including the pregnancy specific glycoproteins and the neuroblastoma breakpoint gene families. Although the duplication history of a paralog family is described by a single tree, we show that IGC has introduced incredible site-to-site variation in the evolutionary relationships among paralogs in the human genome. Our findings indicate that IGC has left significant footprints in patterns of sequence diversity across segmental duplications in the human genome, out-pacing the contributions of single base mutation by orders of magnitude. Collectively, the IGC signals we report comprise a catalog that will provide a critical reference for interpreting observed patterns of DNA sequence variation across duplicated genomic regions, including targets of recent adaptive evolution in humans.

Highlights

  • The ancestral lineage leading to humans and great apes experienced a surge in the rate of genomic duplication and deletion [1,2,3,4,5]

  • If paralogs within a family have a history of interlocus recombination, the corresponding alignment may contain an excess of R sites supporting an alternative unrooted tree topology

  • By expanding our analysis to all paralogous sequences .10 kb in the current reference assembly with $4 copies, our study presents the most comprehensive analysis of interlocus gene conversion (IGC) in the human genome to date, surveying a total of 38.9 Mb of duplicated genomic sequence

Read more

Summary

Introduction

The ancestral lineage leading to humans and great apes experienced a surge in the rate of genomic duplication and deletion [1,2,3,4,5] This sudden burst of large-scale rearrangement contributed to an exceptional human genomic architecture characterized by ,166 Mb of paralogous sequences with .90% sequence identity, including over 95 Mb of sequences with .98% identity [1]. One mechanism of NAHR, can generate large deletions, duplications, and translocations These rearrangements impose an enormous disease burden on human populations, and contribute to species-specific genome structural innovations [6,7,8,9]. The genetic signature of an IGC event is subtle, and escapes detection by standard methods for assaying structural variation, such as array comparative genome hybridization

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call