Abstract

BackgroundNext-generation sequencing technologies allow genomes to be sequenced more quickly and less expensively than ever before. However, as sequencing technology has improved, the difficulty of de novo genome assembly has increased, due in large part to the shorter reads generated by the new technologies. The use of mated sequences (referred to as mate-pairs) is a standard means of disambiguating assemblies to obtain a more complete picture of the genome without resorting to manual finishing. Here, we examine the effectiveness of mate-pair information in resolving repeated sequences in the DNA (a paramount issue to overcome). While it has been empirically accepted that mate-pairs improve assemblies, and a variety of assemblers use mate-pairs in the context of repeat resolution, the effectiveness of mate-pairs in this context has not been systematically evaluated in previous literature.ResultsWe show that, in high-coverage prokaryotic assemblies, libraries of short mate-pairs (about 4-6 times the read-length) more effectively disambiguate repeat regions than the libraries that are commonly constructed in current genome projects. We also demonstrate that the best assemblies can be obtained by 'tuning' mate-pair libraries to accommodate the specific repeat structure of the genome being assembled - information that can be obtained through an initial assembly using unpaired reads. These results are shown across 360 simulations on 'ideal' prokaryotic data as well as assembly of 8 bacterial genomes using SOAPdenovo. The simulation results provide an upper-bound on the potential value of mate-pairs for resolving repeated sequences in real prokaryotic data sets. The assembly results show that our method of tuning mate-pairs exploits fundamental properties of these genomes, leading to better assemblies even when using an off -the-shelf assembler in the presence of base-call errors.ConclusionsOur results demonstrate that dramatic improvements in prokaryotic genome assembly quality can be achieved by tuning mate-pair sizes to the actual repeat structure of a genome, suggesting the possible need to change the way sequencing projects are designed. We propose that a two-tiered approach - first generate an assembly of the genome with unpaired reads in order to evaluate the repeat structure of the genome; then generate the mate-pair libraries that provide most information towards the resolution of repeats in the genome being assembled - is not only possible, but likely also more cost-effective as it will significantly reduce downstream manual finishing costs. In future work we intend to address the question of whether this result can be extended to larger eukaryotic genomes, where repeat structure can be quite different.

Highlights

  • Next-generation sequencing technologies allow genomes to be sequenced more quickly and less expensively than ever before

  • In the context of high-depth prokaryotic sequencing experiments, very short mate-pairs are more useful than long mate-pairs for resolving repeats, and that choosing mate-pair sizes based on the repeat structure of the assembly graph is a powerful approach for creating more complete assemblies

  • For 360 of these genomes, we performed a direct comparison of the amount of repeat resolution that can be obtained when applying information provided by long mate-pairs vs. mate-pairs whose sizes are ‘tuned’ to the repeat structure of the genomes being assembled

Read more

Summary

Introduction

Next-generation sequencing technologies allow genomes to be sequenced more quickly and less expensively than ever before. As sequencing technology has improved, the difficulty of de novo genome assembly has increased, due in large part to the shorter reads generated by the new technologies. The use of mated sequences (referred to as mate-pairs) is a standard means of disambiguating assemblies to obtain a more complete picture of the genome without resorting to manual finishing. Next-generation sequencing platforms such as Illumina, ABI Solid, and 454 allow genomes to be sequenced more quickly and at a lower cost than ever before. Advances in sequencing technologies have not been mirrored by corresponding improvements in finishing - a time-, labor-, and cost-intensive process aimed at reconstructing a complete, gapless, genome sequence from a fragmented assembly. The majority of genomes remain in an incomplete ‘draft’ state, hampering studies that rely on long-range genome structure information (e.g., analysis of operon/regulon structure, or large-scale genomic variation)

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call