Phylogenetic comparative assembly

Peter Husemann,Jens Stoye

doi:10.1186/1748-7188-5-3

Abstract

BackgroundRecent high throughput sequencing technologies are capable of generating a huge amount of data for bacterial genome sequencing projects. Although current sequence assemblers successfully merge the overlapping reads, often several contigs remain which cannot be assembled any further. It is still costly and time consuming to close all the gaps in order to acquire the whole genomic sequence.ResultsHere we propose an algorithm that takes several related genomes and their phylogenetic relationships into account to create a graph that contains the likelihood for each pair of contigs to be adjacent.Subsequently, this graph can be used to compute a layout graph that shows the most promising contig adjacencies in order to aid biologists in finishing the complete genomic sequence. The layout graph shows unique contig orderings where possible, and the best alternatives where necessary.ConclusionsOur new algorithm for contig ordering uses sequence similarity as well as phylogenetic information to estimate adjacencies of contigs. An evaluation of our implementation shows that it performs better than recent approaches while being much faster at the same time.

Highlights

Recent high throughput sequencing technologies are capable of generating a huge amount of data for bacterial genome sequencing projects
The algorithm we present here is based on a simple data structure, the contig adjacency graph, that is introduced
From sequencing projects conducted at Bielefeld University, we obtained the contig sequences for three genomes of the Corynebacteria genus: C. aurimucosum (NC_012590), C. urealyticum [14], and C. kroppenstedtii [15]

Summary

Introduction

Recent high throughput sequencing technologies are capable of generating a huge amount of data for bacterial genome sequencing projects. Current sequence assemblers successfully merge the overlapping reads, often several contigs remain which cannot be assembled any further. It is still costly and time consuming to close all the gaps in order to acquire the whole genomic sequence. In the first genome projects, the process of obtaining the DNA sequence by multi-step clone-byclone sequencing approaches was costly and tedious. The genome is fragmented randomly into small parts Each of these fragments is sequenced, for example, with recent high throughput methods [3,4]. For the ends of two estimated adjacent contigs, specific primer sequences have to be designed that function as start points for two polymerase chain reactions (PCRs) for Sanger

Methods

Results

Conclusion