HGA: de novo genome assembly method for bacterial genomes using high coverage short sequencing reads.

Anas A Al-Okaily

doi:10.1186/s12864-016-2515-7

Abstract

BackgroundCurrent high-throughput sequencing technologies generate large numbers of relatively short and error-prone reads, making the de novo assembly problem challenging. Although high quality assemblies can be obtained by assembling multiple paired-end libraries with both short and long insert sizes, the latter are costly to generate. Recently, GAGE-B study showed that a remarkably good assembly quality can be obtained for bacterial genomes by state-of-the-art assemblers run on a single short-insert library with very high coverage.ResultsIn this paper, we introduce a novel hierarchical genome assembly (HGA) methodology that takes further advantage of such very high coverage by independently assembling disjoint subsets of reads, combining assemblies of the subsets, and finally re-assembling the combined contigs along with the original reads.ConclusionsWe empirically evaluated this methodology for 8 leading assemblers using 7 GAGE-B bacterial datasets consisting of 100 bp Illumina HiSeq and 250 bp Illumina MiSeq reads, with coverage ranging from 100x– ∼200x. The results show that for all evaluated datasets and using most evaluated assemblers (that were used to assemble the disjoint subsets), HGA leads to a significant improvement in the quality of the assembly based on N50 and corrected N50 metrics.Electronic supplementary materialThe online version of this article (doi:10.1186/s12864-016-2515-7) contains supplementary material, which is available to authorized users.

Highlights

Current high-throughput sequencing technologies generate large numbers of relatively short and error-prone reads, making the de novo assembly problem challenging
Interest in the problem has been renewed in the past decade due to the advent of next-generation sequencing (NGS) technologies, which generate large numbers of short (100–400 bp) reads with relative low sequencing error rates
The assembly algorithm works by selecting seed reads and greedily extending them with the maximum overlapping reads until no more overlap is possible

Summary

Introduction

Current high-throughput sequencing technologies generate large numbers of relatively short and error-prone reads, making the de novo assembly problem challenging. GAGE-B study showed that a remarkably good assembly quality can be obtained for bacterial genomes by state-of-the-art assemblers run on a single short-insert library with very high coverage. The assembly algorithm works by selecting seed reads and greedily extending them with the maximum overlapping reads until no more overlap is possible. This approach was adopted by some early assemblers such as SSAKE [1], SHARCGS [2], and VCAKE [3]. The greedy approach doesn’t take into account ambiguities induced by repeats and sequencing errors, resulting in a large number of mis-assembly errors

Methods

Results

Discussion

Conclusion