Abstract

BackgroundThe de novo assembly of genomes and transcriptomes from short sequences is a challenging problem. Because of the high coverage needed to assemble short sequences as well as the overhead of modeling the assembly problem as a graph problem, the methods for short sequence assembly are often validated using data from BACs or small sized prokaryotic genomes.ResultsWe present a parallel method for transcriptome assembly from large short sequence data sets. Our solution uses a rigorous graph theoretic framework and tames the computational and space complexity using parallel computers. First, we construct a distributed bidirected graph that captures overlap information. Next, we compact all chains in this graph to determine long unique contigs using undirected parallel list ranking, a problem for which we present an algorithm. Finally, we process this compacted distributed graph to resolve unique regions that are separated by repeats, exploiting the naturally occurring coverage variations arising from differential expression.ConclusionWe demonstrate the validity of our method using a synthetic high coverage data set generated from the predicted coding regions of Zea mays. We assemble 925 million sequences consisting of 40 billion nucleotides in a few minutes on a 1024 processor Blue Gene/L. Our method is the first fully distributed method for assembling a non-hierarchical short sequence data set and can scale to large problem sizes.

Highlights

  • The de novo assembly of genomes and transcriptomes from short sequences is a challenging problem

  • The promise of inexpensive short reads has opened the door to the possibilities of resequencing individuals and sequencing more organisms at lower cost

  • We present a method for assembling the transcriptome of an organism from short reads derived from unnormalized expression libraries

Read more

Summary

Introduction

The de novo assembly of genomes and transcriptomes from short sequences is a challenging problem. Because of the high coverage needed to assemble short sequences as well as the overhead of modeling the assembly problem as a graph problem, the methods for short sequence assembly are often validated using data from BACs or small sized prokaryotic genomes. Introduction The development of high-throughput short sequencing technologies, such as the Illumina Solexa and Applied Biosystems Solid systems, has sparked renewed interest in sequence assembly. An important problem in short sequence assembly is de novo genome reconstruction. For genomes with high repeat content, this task is already difficult with the much longer Sanger reads [1]. BMC Bioinformatics 2009, 10(Suppl 1):S14 http://www.biomedcentral.com/1471-2105/10/S1/S14 graph models rather than to the overlap-based greedy heuristics often utilized for Sanger reads. Graph models of particular interest include De Bruijn graphs and string graphs in either directed or bidirected forms

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call