Genomic characterization of large heterochromatic gaps in the human genome assembly.

Nicolas Altemose,Huntington F Willard,Karen H Miga,Mauro Maggioni

doi:10.1371/journal.pcbi.1003628

Abstract

The largest gaps in the human genome assembly correspond to multi-megabase heterochromatic regions composed primarily of two related families of tandem repeats, Human Satellites 2 and 3 (HSat2,3). The abundance of repetitive DNA in these regions challenges standard mapping and assembly algorithms, and as a result, the sequence composition and potential biological functions of these regions remain largely unexplored. Furthermore, existing genomic tools designed to predict consensus-based descriptions of repeat families cannot be readily applied to complex satellite repeats such as HSat2,3, which lack a consistent repeat unit reference sequence. Here we present an alignment-free method to characterize complex satellites using whole-genome shotgun read datasets. Utilizing this approach, we classify HSat2,3 sequences into fourteen subfamilies and predict their chromosomal distributions, resulting in a comprehensive satellite reference database to further enable genomic studies of heterochromatic regions. We also identify 1.3 Mb of non-repetitive sequence interspersed with HSat2,3 across 17 unmapped assembly scaffolds, including eight annotated gene predictions. Finally, we apply our satellite reference database to high-throughput sequence data from 396 males to estimate array size variation of the predominant HSat3 array on the Y chromosome, confirming that satellite array sizes can vary between individuals over an order of magnitude (7 to 98 Mb) and further demonstrating that array sizes are distributed differently within distinct Y haplogroups. In summary, we present a novel framework for generating initial reference databases for unassembled genomic regions enriched with complex satellite DNA, and we further demonstrate the utility of these reference databases for studying patterns of sequence variation within human populations.

Highlights

Long arrays of near-identical tandem repeats, termed satellite DNAs, compose the predominant sequence feature within constitutive heterochromatin in complex genomes [1]
In addition to whole chromosome shotgun (WCS) assignments, we studied those HSat2,3 sequences currently present in the human reference assembly (GRCh37/hg19) [62], including those found directly adjacent to heterochromatin gaps (983 kb total), as well as a much smaller number interspersed within chromosome arm assemblies (31 kb total)
We present a computational framework for studying both the satellite and non-satellite components of heterochromatic genomic regions, optimized for satellite families that are composed of a complex arrangement of simple repeats, a common feature of satellite families across diverse taxa

Summary

Introduction

Long arrays of near-identical tandem repeats, termed satellite DNAs, compose the predominant sequence feature within constitutive heterochromatin in complex genomes [1]. Attempts to detect and characterize novel tandem repeats are often dependent on long read lengths and on the absence of interspersed stretches of exact sequence identity within each tandem repeat [8]. This presents a considerable problem for characterizing satellite DNA families that are defined by an irregular repeat unit length composed of complex arrangements of short repeats. As a result, this type of complex satellite DNA remains largely uncharacterized, even within the well-characterized human genome. Unlike those methods employed to characterize satellite families with well-defined tandem repeat lengths (e.g. [5,6,9]), limited sequence-based tools exist to explore the nature of short, irregular repeat sequences across diverse genomic datasets

Methods

Results

Conclusion