Multiple sequence alignment and reconstructing phylogenetic trees with Hadoop

Quan Zou Quan Zou

doi:10.1109/bibm.2016.7822735

Quan Zou Quan Zou

https://doi.org/10.1109/bibm.2016.7822735

Copy DOI

Export

Save

Cite

Publication Date: Dec 1, 2016

Affiliation: Tianjin University

Abstract
Full-Text
Similar Papers

Abstract

Listen

Multiple sequence alignment (MSA) is the “Holy Grail” problem in computational biology, but bottlenecks arise in the massive MSA of homologous sequences. Most of the available state-of-the-art software tools cannot address large-scale datasets, or they run rather slowly. The similarity of homologous DNA sequences is often ignored. Lack of parallelization is still a challenge for MSA research. Building the phylogenetic trees for ultra-large sequences is also a time-consuming work. MSA is the previous work for phylogenetic reconstruction. With the development of parallel computation, we employed Hadoop platform to solve the two computational intensive problems. Trie trees and suffix trees were used for accelerating multiple similar DNA sequences alignment. The expected time complexity was decreased to linear time from square time. For the phylogenetic tree reconstruction, clustering and multiple-sequence alignment were executed in parallel, and the basic phylogenetic trees were built using the neighbour-joining model. Experiments on two large datasets, both more than 1 GB, show that our software tool can outperform other common phylogenetic reconstruction tools. Furthermore, data, software codes, and web servers were all opened in http://lab.malab.cn/soft/halign/ and http://lab.malab.cn/soft/HPtree/

Full Text