Abstract

BackgroundMapReduce is a parallel framework that has been used effectively to design large-scale parallel applications for large computing clusters. In this paper, we evaluate the viability of the MapReduce framework for designing phylogenetic applications. The problem of interest is generating the all-to-all Robinson-Foulds distance matrix, which has many applications for visualizing and clustering large collections of evolutionary trees. We introduce MrsRF (MapReduce Speeds up RF), a multi-core algorithm to generate a t × t Robinson-Foulds distance matrix between t trees using the MapReduce paradigm.ResultsWe studied the performance of our MrsRF algorithm on two large biological trees sets consisting of 20,000 trees of 150 taxa each and 33,306 trees of 567 taxa each. Our experiments show that MrsRF is a scalable approach reaching a speedup of over 18 on 32 total cores. Our results also show that achieving top speedup on a multi-core cluster requires different cluster configurations. Finally, we show how to use an RF matrix to summarize collections of phylogenetic trees visually.ConclusionOur results show that MapReduce is a promising paradigm for developing multi-core phylogenetic applications. The results also demonstrate that different multi-core configurations must be tested in order to obtain optimum performance. We conclude that RF matrices play a critical role in developing techniques to summarize large collections of trees.

Highlights

  • MapReduce is a parallel framework that has been used effectively to design largescale parallel applications for large computing clusters

  • MrsRF: Computing a t × t RF matrix We introduce MrsRF (MapReduce Speeds up RF), a multicore all-to-all RF distance matrix algorithm using the MapReduce framework

  • Our results show that MrsRF is a very scalable approach for computing the all-to-all RF Matrix, with performance increasing with large problem sizes

Read more

Summary

Introduction

MapReduce is a parallel framework that has been used effectively to design largescale parallel applications for large computing clusters. We evaluate the viability of the MapReduce framework for designing phylogenetic applications. The problem of interest is generating the all-to-all Robinson-Foulds distance matrix, which has many applications for visualizing and clustering large collections of evolutionary trees. MapReduce [1] is an exciting new paradigm for designing parallel applications. It was popularized by Google to support the parallel and distributed execution of data intensive applications. To process petabytes of data, Google executes thousands of MapReduce applications per day. There is interest within the bioinformatics community to harness the power of MapReduce to develop parallel applications to process large datasets of genomic data. We study whether (page number not for citation purposes)

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call