Abstract

Centromeric alpha satellite (AS) is composed of highly identical higher-order DNA repetitive sequences, which make the standard assembly process impossible. Because of this the AS repeats were severely underrepresented in previous versions of the human genome assembly showing large centromeric gaps. The latest hg38 assembly (GCA_000001405.15) employed a novel method of approximate representation of these sequences using AS reference models to fill the gaps. Therefore, a lot more of assembled AS became available for genomic analysis. We used the PERCON program previously described by us to annotate various suprachromosomal families (SFs) of AS in the hg38 assembly and presented the results of our primary analysis as an easy-to-read track for the UCSC Genome Browser. The monomeric classes, characteristic of the five known SFs, were color-coded, which allowed quick visual assessment of AS composition in whole multi-megabase centromeres down to each individual AS monomer. Such comprehensive annotation of AS in the human genome assembly was performed for the first time. It showed the expected prevalence of the known major types of AS organization characteristic of the five established SFs. Also, some less common types of AS arrays were identified, such as pure R2 domains in SF5, apparent J/R and D/R mixes in SF1 and SF2, and several different SF4 higher-order repeats among reference models and in regular contigs. No new SFs or large unclassed AS domains were discovered. The dataset reveals the architecture of human centromeres and allows classification of AS sequence reads by alignment to the annotated hg38 assembly. The data were deposited here: http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&hgt.customText=https://dl.dropboxusercontent.com/u/22994534/AS-tracks/human-GRC-hg38-M1SFs.bed.bz2.

Highlights

  • Repository Citation Shepelev VA, Uralsky LI, Alexandrov AA, Yurov YB, Rogaev EI, Alexandrov IA. (2015)

  • The dataset reveals the architecture of human centromeres and allows classification of alpha satellite (AS) sequence reads by alignment to the annotated hg38 assembly

  • SF1 and SF2 sequences are uniformly arranged in arrays with J1J2 and D1D2 dimeric periodicities, the remnants of W1W2W3W4W5 pentameric order can be discerned in SF3 sequences, and SF5 clusters demonstrate irregular alternation of R1 and R2 monomers

Read more

Summary

A general layout of AS sequences in hg38 assembly

Centromeric regions of human chromosomes in hg assembly [1] (GCA_000001405.15) can be divided in two main parts. SF4 group contains all the older layers of non-HOR AS It has been subdivided into a number of SFs, most of which have not yet received formal names pending finalization of a new classification system. Reference models are not real DNA sequences like traditional GenBank contigs, but instead are collections of all WGS reads, that match a certain HOR, put into a contig by the stochastic approach of using a generative Markov process, which is not expected to recreate the true long-range linear order across the entire array [1,9] They can be very helpful in mapping the AS deep sequencing or WGS reads to the human genome assembly. The identical sets of 3 AS reference models (of which only one is alive) appear on chromosomes 5 and 19 (paired domain 5/19), and the live model from this set appears on chromosome 1 where the HOR is very similar to 5/19 paired domain and apparently cannot be distinguished by reference model assembly process (see Tables 2 and S1)

AS classification used by PERCON in the context of the human genome
Live 5
PERCON program
UCSC Browser Track
Overall statistics of AS in hg38 assembly
Annotation of AS HOR reference models
Discussion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.