Parallel short sequence assembly of transcriptomes

Benjamin G Jackson,Srinivas Aluru,Patrick S Schnable

doi:10.1186/1471-2105-10-s1-s14

Abstract

BackgroundThe de novo assembly of genomes and transcriptomes from short sequences is a challenging problem. Because of the high coverage needed to assemble short sequences as well as the overhead of modeling the assembly problem as a graph problem, the methods for short sequence assembly are often validated using data from BACs or small sized prokaryotic genomes.ResultsWe present a parallel method for transcriptome assembly from large short sequence data sets. Our solution uses a rigorous graph theoretic framework and tames the computational and space complexity using parallel computers. First, we construct a distributed bidirected graph that captures overlap information. Next, we compact all chains in this graph to determine long unique contigs using undirected parallel list ranking, a problem for which we present an algorithm. Finally, we process this compacted distributed graph to resolve unique regions that are separated by repeats, exploiting the naturally occurring coverage variations arising from differential expression.ConclusionWe demonstrate the validity of our method using a synthetic high coverage data set generated from the predicted coding regions of Zea mays. We assemble 925 million sequences consisting of 40 billion nucleotides in a few minutes on a 1024 processor Blue Gene/L. Our method is the first fully distributed method for assembling a non-hierarchical short sequence data set and can scale to large problem sizes.

Highlights

The de novo assembly of genomes and transcriptomes from short sequences is a challenging problem
The promise of inexpensive short reads has opened the door to the possibilities of resequencing individuals and sequencing more organisms at lower cost
We present a method for assembling the transcriptome of an organism from short reads derived from unnormalized expression libraries

Summary

Introduction

The de novo assembly of genomes and transcriptomes from short sequences is a challenging problem. Because of the high coverage needed to assemble short sequences as well as the overhead of modeling the assembly problem as a graph problem, the methods for short sequence assembly are often validated using data from BACs or small sized prokaryotic genomes. Introduction The development of high-throughput short sequencing technologies, such as the Illumina Solexa and Applied Biosystems Solid systems, has sparked renewed interest in sequence assembly. An important problem in short sequence assembly is de novo genome reconstruction. For genomes with high repeat content, this task is already difficult with the much longer Sanger reads [1]. BMC Bioinformatics 2009, 10(Suppl 1):S14 http://www.biomedcentral.com/1471-2105/10/S1/S14 graph models rather than to the overlap-based greedy heuristics often utilized for Sanger reads. Graph models of particular interest include De Bruijn graphs and string graphs in either directed or bidirected forms

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Jan 1, 2009
Citations: 55	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Parallel short sequence assembly of transcriptomes

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Optimization of de novo transcriptome assembly from high-throughput short read sequencing data improves functional annotation for non-model organisms
Berat Z Haznedaroglu ... Jordan Peccia
BMC Bioinformatics | VOL. 13
Berat Z Haznedaroglu, et. al.Berat Z Haznedaroglu ... Jordan Peccia
18 Jul 2012
BMC Bioinformatics | VOL. 13

Genetic Algorithm Based Probabilistic Motif Discovery in Unaligned Biological Sequences
M Hemalatha ... K Vivekanand
Journal of Computer Science | VOL. 4
M Hemalatha, et. al.M Hemalatha ... K Vivekanand
01 Aug 2008
Journal of Computer Science | VOL. 4

Efficient counting of k-mers in DNA sequences using a bloom filter
Páll Melsted ... Jonathan K Pritchard
BMC Bioinformatics | VOL. 12
Páll Melsted, et. al.Páll Melsted ... Jonathan K Pritchard
10 Aug 2011
BMC Bioinformatics | VOL. 12

Block based video alignment with linear time and space complexity
Armin Kappeler ... Aggelos K Katsaggelos
-
Armin Kappeler, et. al.Armin Kappeler ... Aggelos K Katsaggelos
01 Sep 2016
01 Sep 2016

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Parallel short sequence assembly of transcriptomes

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics