Abstract

Polymorphic Tandem Repeat (PTR) is a common form of polymorphism in the human genome. A PTR consists in a variation found in an individual (or in a population) of the number of repeating units of a Tandem Repeat (TR) locus of the genome with respect to the reference genome. Several phenotypic traits and diseases have been discovered to be strongly associated with or caused by specific PTR loci. PTR are further distinguished in two main classes: Short Tandem Repeats (STR) when the repeating unit has size up to 6 base pairs, and Variable Number Tandem Repeats (VNTR) for repeating units of size above 6 base pairs. As larger and larger populations are screened via high throughput sequencing projects, it becomes technically feasible and desirable to explore the association between PTR and a panoply of such traits and conditions. In order to facilitate these studies, we have devised a method for compiling catalogs of PTR from assembled genomes, and we have produced a catalog of PTR for genic regions (exons, introns, UTR and adjacent regions) of the human genome (GRCh38). We applied four different TR discovery software tools to uncover in the first phase 55,223,485 TR (after duplicate removal) in GRCh38, of which 373,173 were determined to be PTR in the second phase by comparison with five assembled human genomes. Of these, 263,266 are not included by state-of-the-art PTR catalogs. The new methodology is mainly based on a hierarchical and systematic application of alignment-based sequence comparisons to identify and measure the polymorphism of TR. While previous catalogs focus on the class of STR of small total size, we remove any size restrictions, aiming at the more general class of PTR, and we also target fuzzy TR by using specific detection tools. Similarly to other previous catalogs of human polymorphic loci, we focus our catalog toward applications in the discovery of disease-associated loci. Validation by cross-referencing with existing catalogs on common clinically-relevant loci shows good concordance. Overall, this proposed census of human PTR in genic regions is a shared resource (web accessible), complementary to existing catalogs, facilitating future genome-wide studies involving PTR.

Highlights

  • Tandem repeats (TR) in DNA sequences are patterns of similar subsequences directly adjacent to each other

  • High throughput sequencing technologies are becoming instrumental in the task of measuring accurately Polymorphic Tandem Repeat (PTR) in individuals and populations, with steady technological improvements

  • Our approach uses as input human genome assemblies and can be considered as a natural extension of the approach proposed by Payseur et al (2011), that uses only one Tandem Repeat (TR) detection tool on the reference genome and one additional assembled genome to measure TR expansion/contraction, after positioning of the flanking regions

Read more

Summary

Introduction

Tandem repeats (TR) in DNA sequences are patterns of similar subsequences directly adjacent to each other. TR with a repeat unit from 1 to 10 Kb on a string are termed Tandem Copy Number Variations (TCNV) (He et al, 2011). Microsatellites are termed Short Tandem Repeats (STR), while minisatellites are termed Variable Number Tandem Repeats (VNTR), when emphasis is placed on their highly polymorphic nature (Gelfand et al, 2014). The molecular mechanisms that generate variability of the number of repeating units of VNTR and STR loci in a population are distinct. In STR, repeat number variability is mostly generated by strand-slippage during replication by the DNA polymerase (Fan and Chu, 2007; Mirkin, 2007). Variability of VNTR is detected by restriction fragment length polymorphism (RFLP), a restriction digestion followed by Southern hybridization with a minisatellite probe (Nakamura et al, 1987; Sreenan et al, 1997)

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call