Rapid quantification of sequence repeats to resolve the size, structure and contents of bacterial genomes

David Williams,William L Trimble,Folker Meyer,Howard Ochman,Meghan Shilts

doi:10.1186/1471-2164-14-537

Abstract

BackgroundThe numerous classes of repeats often impede the assembly of genome sequences from the short reads provided by new sequencing technologies. We demonstrate a simple and rapid means to ascertain the repeat structure and total size of a bacterial or archaeal genome without the need for assembly by directly analyzing the abundances of distinct k-mers among reads.ResultsThe sensitivity of this procedure to resolve variation within a bacterial species is demonstrated: genome sizes and repeat structure of five environmental strains of E. coli from short Illumina reads were estimated by this method, and total genome sizes corresponded well with those obtained for the same strains by pulsed-field gel electrophoresis. In addition, this approach was applied to read-sets for completed genomes and shown to be accurate over a wide range of microbial genome sizes.ConclusionsApplication of these procedures, based solely on k-mer abundances in short read data sets, allows aspects of genome structure to be resolved that are not apparent from conventional short read assemblies. This knowledge of the repetitive content of genomes provides insights into genome evolution and diversity.

Highlights

The numerous classes of repeats often impede the assembly of genome sequences from the short reads provided by new sequencing technologies
Abundance analysis using smaller k-mers is sensitive to shorter repeats, resulting in larger proportions of sequence inferred to occur in multiple copies (Additional file 1: Table S1) and k-mers shorter than 15 nucleotides in length are more likely to match by chance alone
The wider range of repeat lengths spanned by longer k-mers causes abundance analysis to be less sensitive to short repeats, such that smaller fractions of the genome will be classified as part of a repeat

Summary

Introduction

The numerous classes of repeats often impede the assembly of genome sequences from the short reads provided by new sequencing technologies. Single copy sequence and repeat regions shorter than the read length can be accurately assembled, yielding draft genomes that consist of unordered contigs separated by gaps of unspecified size [1]. These technological limitations conceal the size and organization of a genome because the number, contents and distribution of repeat arrays remain unknown. Even bacterial genomes, which consist mostly of single-copy protein-encoding genes, can contain multiple classes of repetitive sequences that can prevent their assembly and analysis. There are several other types of repetitive sequences, including the 21-to-65-nt repetitive extragenic palindromic (REP) sequences [9], the 127-bp palindromic enterobacterial repetitive intergenic consensus (ERIC) sequences [10], and the bacterial interspersed mosaic elements (BIME) [11], each of which can be dispersed throughout the genome or exist in tandem arrays [12]

Methods

Results

Conclusion