Abstract

The vast majority of bacterial genome sequencing has been performed using Illumina short reads. Because of the inherent difficulty of resolving repeated regions with short reads alone, only ∼10% of sequencing projects have resulted in a closed genome. The most common repeated regions are those coding for ribosomal operons (rDNAs), which occur in a bacterial genome between 1 and 15 times, and are typically used as sequence markers to classify and identify bacteria. Here, we exploit the genomic context in which rDNAs occur across taxa to improve assembly of these regions relative to de novo sequencing by using the conserved nature of rDNAs across taxa and the uniqueness of their flanking regions within a genome. We describe a method to construct targeted pseudocontigs generated by iteratively assembling reads that map to a reference genome’s rDNAs. These pseudocontigs are then used to more accurately assemble the newly sequenced chromosome. We show that this method, implemented as riboSeed, correctly bridges across adjacent contigs in bacterial genome assembly and, when used in conjunction with other genome polishing tools, can assist in closure of a genome.

Highlights

  • Sequencing bacterial genomes has become much more cost effective and convenient, but the number of complete, closed bacterial genomes remains a small fraction of the total number sequenced (Figure 1)

  • Draft genomes are often of very high quality and suited for many types of analysis, researchers must choose between working with these draft genomes, or spending time and resources polishing the genome with some combination of in silico tools, polymerase chain reaction (PCR), optical mapping, re-sequencing or hybrid sequencing [1,3]

  • RDNA and 1 kb flanking regions were extracted from Escherichia coli Sakai [28] (BA000007.2), a strain in which rDNAs have been well characterized [29]

Read more

Summary

Introduction

Sequencing bacterial genomes has become much more cost effective and convenient, but the number of complete, closed bacterial genomes remains a small fraction of the total number sequenced (Figure 1). Even with the advent of new technologies for long-read sequencing and improvements to short read platforms, assemblies typically remain in draft status due to the computational bottleneck of genome closure [1,2]. The Illumina entries in NCBI’s Sequence Read Archive (SRA) [4] outnumber all other technologies combined by about an order of magnitude (Supplementary Table S1). Draft assemblies from these datasets have systematic problems common to short read datasets, including gaps in the scaffolds due to the difficulty of resolving assemblies of repeated regions [5,6]. By resolving repeated regions during the assembly process, it may be possible to improve existing assemblies, and obtain additional sequence information from existing short read datasets in the SRA or the European Nucleotide Archive

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.