Murasaki: A Fast, Parallelizable Algorithm to Find Anchors from Multiple Genomes

Kris Popendorf,Yasunori Osana,Yasubumi Sakakibara,Hachiya Tsuyoshi,Darren P Martin

doi:10.1371/journal.pone.0012651

Kris Popendorf, Yasunori Osana + Show 3 more

Open Access

https://doi.org/10.1371/journal.pone.0012651

Copy DOI

Journal: PLoS ONE	Publication Date: Sep 24, 2010
Citations: 49	License type: CC BY 4.0

Affiliation: Keio University, Seikei University

Abstract

BackgroundWith the number of available genome sequences increasing rapidly, the magnitude of sequence data required for multiple-genome analyses is a challenging problem. When large-scale rearrangements break the collinearity of gene orders among genomes, genome comparison algorithms must first identify sets of short well-conserved sequences present in each genome, termed anchors. Previously, anchor identification among multiple genomes has been achieved using pairwise alignment tools like BLASTZ through progressive alignment tools like TBA, but the computational requirements for sequence comparisons of multiple genomes quickly becomes a limiting factor as the number and scale of genomes grows.Methodology/Principal FindingsOur algorithm, named Murasaki, makes it possible to identify anchors within multiple large sequences on the scale of several hundred megabases in few minutes using a single CPU. Two advanced features of Murasaki are (1) adaptive hash function generation, which enables efficient use of arbitrary mismatch patterns (spaced seeds) and therefore the comparison of multiple mammalian genomes in a practical amount of computation time, and (2) parallelizable execution that decreases the required wall-clock and CPU times. Murasaki can perform a sensitive anchoring of eight mammalian genomes (human, chimp, rhesus, orangutan, mouse, rat, dog, and cow) in 21 hours CPU time (42 minutes wall time). This is the first single-pass in-core anchoring of multiple mammalian genomes. We evaluated Murasaki by comparing it with the genome alignment programs BLASTZ and TBA. We show that Murasaki can anchor multiple genomes in near linear time, compared to the quadratic time requirements of BLASTZ and TBA, while improving overall accuracy.Conclusions/SignificanceMurasaki provides an open source platform to take advantage of long patterns, cluster computing, and novel hash algorithms to produce accurate anchors across multiple genomes with computational efficiency significantly greater than existing methods. Murasaki is available under GPL at http://murasaki.sourceforge.net.

Highlights

With the number of available genome sequences increasing rapidly, the magnitude of sequence data required for multiple-genome analyses is a challenging problem
In Murasaki we introduce a novel hash function generation algorithm to automatically generate hash functions from arbitrary spaced seed patterns that approximate maximal hash key space utilization in a computationally inexpensive manner, which we term the ‘‘adaptive hash algorithm.’’ The details of this algorithm are explained in the Methods section
When an anchor is initially constructed based on a set of matching seeds, both ends are extended by an ungapped alignment until the minimum pairwise score falls below the X-dropoff parameter as in BLAST and BLASTZ [10]

Summary

Introduction

With the number of available genome sequences increasing rapidly, the magnitude of sequence data required for multiple-genome analyses is a challenging problem. When large-scale rearrangements break the collinearity of gene orders among genomes, genome comparison algorithms must first identify sets of short well-conserved sequences present in each genome, termed anchors. Anchor identification among multiple genomes has been achieved using pairwise alignment tools like BLASTZ through progressive alignment tools like TBA, but the computational requirements for sequence comparisons of multiple genomes quickly becomes a limiting factor as the number and scale of genomes grows

Objectives

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Murasaki: A Fast, Parallelizable Algorithm to Find Anchors from Multiple Genomes

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS ONE

Lead the way for us

Similar Papers

Novel Computational Methods for Large Scale Genome Comparison
Todd J. Treangen ... Xavier Messeguer
-
Todd J. Treangen, et. al.Todd J. Treangen ... Xavier Messeguer
01 Jan 2009
01 Jan 2009

Two-Level Parallelism to Accelerate Multiple Genome Comparisons
Oscar Torreno ... Oswaldo Trelles
-
Oscar Torreno, et. al.Oscar Torreno ... Oswaldo Trelles
01 Jan 2017
01 Jan 2017

M-GCAT: interactively and efficiently constructing large-scale multiple genome comparison frameworks in closely related species.
Todd J Treangen ... Xavier Messeguer
BMC Bioinformatics | VOL. 7
Todd J Treangen, et. al.Todd J Treangen ... Xavier Messeguer
05 Oct 2006
BMC Bioinformatics | VOL. 7

CLUSTERING OF MAIN ORTHOLOGS FOR MULTIPLE GENOMES
Zheng Fu ... Tao Jiang
-
Zheng Fu, et. al.Zheng Fu ... Tao Jiang
01 Sep 2007
01 Sep 2007

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Murasaki: A Fast, Parallelizable Algorithm to Find Anchors from Multiple Genomes

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS ONE