Abstract

BackgroundWhile next-generation sequencing technologies have made sequencing genomes faster and more affordable, deciphering the complete genome sequence of an organism remains a significant bioinformatics challenge, especially for large genomes. Low sequence coverage, repetitive elements and short read length make de novo genome assembly difficult, often resulting in sequence and/or fragment “gaps” – uncharacterized nucleotide (N) stretches of unknown or estimated lengths. Some of these gaps can be closed by re-processing latent information in the raw reads. Even though there are several tools for closing gaps, they do not easily scale up to processing billion base pair genomes.ResultsHere we describe Sealer, a tool designed to close gaps within assembly scaffolds by navigating de Bruijn graphs represented by space-efficient Bloom filter data structures. We demonstrate how it scales to successfully close 50.8 % and 13.8 % of gaps in human (3 Gbp) and white spruce (20 Gbp) draft assemblies in under 30 and 27 h, respectively – a feat that is not possible with other leading tools with the breadth of data used in our study.ConclusionSealer is an automated finishing application that uses the succinct Bloom filter representation of a de Bruijn graph to close gaps in draft assemblies, including that of very large genomes. We expect Sealer to have broad utility for finishing genomes across the tree of life, from bacterial genomes to large plant genomes and beyond. Sealer is available for download at https://github.com/bcgsc/abyss/tree/sealer-release.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-015-0663-4) contains supplementary material, which is available to authorized users.

Highlights

  • While next-generation sequencing technologies have made sequencing genomes faster and more affordable, deciphering the complete genome sequence of an organism remains a significant bioinformatics challenge, especially for large genomes

  • It is critical to develop tools that can scale up to these large datasets while using minimal computing resources. Projects such as the 1000 Genomes Project [3], The Cancer Genome Atlas [http:// cancergenome.nih.gov/], and clinical uses of wholegenome sequencing [4] highlight the trend of processing Giga base pairs (Gbp)-scale datasets. Even though these projects are about re-sequencing human genomes and transcriptomes, it was demonstrated that de novo assembly of the raw reads provides valuable information on structural variations [5,6,7,8]

  • We demonstrate the scalability of Sealer on the white spruce (P. glauca) draft genome [13], which it processes under 27 h using 40 GB Random Access Memory (RAM) – resources that can be found in contemporary commodity desktop computers

Read more

Summary

Results

We describe Sealer, a tool designed to close gaps within assembly scaffolds by navigating de Bruijn graphs represented by space-efficient Bloom filter data structures. We demonstrate how it scales to successfully close 50.8 % and 13.8 % of gaps in human (3 Gbp) and white spruce (20 Gbp) draft assemblies in under 30 and 27 h, respectively – a feat that is not possible with other leading tools with the breadth of data used in our study

Conclusion
Background
Results and discussion
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.