Finishing bacterial genome assemblies with Mix

Hayssam Soueidan,Alexis Groppi,Virginie Dupuy,Florence Maurier,Christine Citti,Pascal Sirand-Pugnet,Macha Nikolski,Florence Tardy

doi:10.1186/1471-2105-14-s15-s16

Abstract

MotivationAmong challenges that hamper reaping the benefits of genome assembly are both unfinished assemblies and the ensuing experimental costs. First, numerous software solutions for genome de novo assembly are available, each having its advantages and drawbacks, without clear guidelines as to how to choose among them. Second, these solutions produce draft assemblies that often require a resource intensive finishing phase.MethodsIn this paper we address these two aspects by developing Mix , a tool that mixes two or more draft assemblies, without relying on a reference genome and having the goal to reduce contig fragmentation and thus speed-up genome finishing. The proposed algorithm builds an extension graph where vertices represent extremities of contigs and edges represent existing alignments between these extremities. These alignment edges are used for contig extension. The resulting output assembly corresponds to a set of paths in the extension graph that maximizes the cumulative contig length.ResultsWe evaluate the performance of Mix on bacterial NGS data from the GAGE-B study and apply it to newly sequenced Mycoplasma genomes. Resulting final assemblies demonstrate a significant improvement in the overall assembly quality. In particular, Mix is consistent by providing better overall quality results even when the choice is guided solely by standard assembly statistics, as is the case for de novo projects.AvailabilityMix is implemented in Python and is available at https://github.com/cbib/MIX, novel data for our Mycoplasma study is available at http://services.cbib.u-bordeaux2.fr/mix/.

Highlights

Moving a genome from the draft assembly stage to a complete finished genome is a labor-intensive task requiring time and further experimental work
Genome assembly is a lively field that has produced in the recent years numerous algorithms and tools, such as MIRA [1], CLC, ABySS [2], etc
In the current manuscript we describe Mix a finishing algorithm that generates an assembly starting from different genome assemblies with the main objective of reducing contig fragmentation and maximizing the cumulative contig length

Summary

Introduction

Moving a genome from the draft assembly stage to a complete finished genome is a labor-intensive task requiring time and further experimental work. While in silico finishing can not resolve all of these issues, Genome assembly is a lively field that has produced in the recent years numerous algorithms and tools, such as MIRA [1], CLC (http://www.clcbio.com/genomics), ABySS [2], etc. Assemblers differ in their algorithmic foundations and present different advantages and pitfalls. Bring into that the fact that reassembling an already assembled genome based on a new sequencing technology (e.g., Illumina vs Sanger) can reveal sequences that are missing in the reference assembly [3], and we end up with a very large space of obtainable de novo draft assemblies

Objectives

Methods

Results

Conclusion