An efficient and extensible approach for compressing phylogenetic trees

Suzanne J Matthews,Tiffani L Williams

doi:10.1186/1471-2105-12-s10-s16

Abstract

BackgroundBiologists require new algorithms to efficiently compress and store their large collections of phylogenetic trees. Our previous work showed that TreeZip is a promising approach for compressing phylogenetic trees. In this paper, we extend our TreeZip algorithm by handling trees with weighted branches. Furthermore, by using the compressed TreeZip file as input, we have designed an extensible decompressor that can extract subcollections of trees, compute majority and strict consensus trees, and merge tree collections using set operations such as union, intersection, and set difference.ResultsOn unweighted phylogenetic trees, TreeZip is able to compress Newick files in excess of 98%. On weighted phylogenetic trees, TreeZip is able to compress a Newick file by at least 73%. TreeZip can be combined with 7zip with little overhead, allowing space savings in excess of 99% (unweighted) and 92%(weighted). Unlike TreeZip, 7zip is not immune to branch rotations, and performs worse as the level of variability in the Newick string representation increases. Finally, since the TreeZip compressed text (TRZ) file contains all the semantic information in a collection of trees, we can easily filter and decompress a subset of trees of interest (such as the set of unique trees), or build the resulting consensus tree in a matter of seconds. We also show the ease of which set operations can be performed on TRZ files, at speeds quicker than those performed on Newick or 7zip compressed Newick files, and without loss of space savings.ConclusionsTreeZip is an efficient approach for compressing large collections of phylogenetic trees. The semantic and compact nature of the TRZ file allow it to be operated upon directly and quickly, without a need to decompress the original Newick file. We believe that TreeZip will be vital for compressing and archiving trees in the biological community.

Highlights

Biologists require new algorithms to efficiently compress and store their large collections of phylogenetic trees
In addition to extracting all of the trees contained in a compressed TRZ file, we show how the TreeZip format can be used to perform additional extraction operations and constructing majority and strict consensus trees
Our results show that the compressed TreeZip (TRZ) file is over 74% smaller than the original Newick file on weighted collections

Summary

Introduction

Biologists require new algorithms to efficiently compress and store their large collections of phylogenetic trees. To reconstruct a phylogenetic tree, the most popular techniques (such as MrBayes [1] and TNT [2]) often return tens to hundreds of thousands of trees that represent -plausible or closely-related hypotheses (or candidate trees) for how the taxa evolved from a common ancestor. Given that phylogenetic searches return tens to hundreds of thousands of candidate evolutionary trees, biologists need new techniques for managing and sharing these large tree collections effectively. As biologists obtain more data to produce evolutionary trees, phylogenetic techniques must reconstruct larger trees, resulting in ever-larger collections of candidate trees. There is a critical need to develop phylogenetic compression techniques that reduce the requirements of storing large tree collections so that they can be shared with colleagues around the world

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Oct 18, 2011
Citations: 5	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

An efficient and extensible approach for compressing phylogenetic trees

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Constructing liberal and conservative supertrees and exact solutions for reduced consensus problems
Jianrong Dong
-
Jianrong DongJianrong Dong
31 Oct 2012
31 Oct 2012

Efficiency of Strict Consensus Trees
Mark Wilkinson ... R Olmstead
Systematic Biology | VOL. 50
Mark Wilkinson, et. al.Mark Wilkinson ... R Olmstead
01 Aug 2001
Systematic Biology | VOL. 50

The asymmetric median tree — A new model for building consensus trees
Cynthia Phillips ... Tandy J Warnow
-
Cynthia Phillips, et. al.Cynthia Phillips ... Tandy J Warnow
01 Jan 1996
01 Jan 1996

A comparison of methods for constructing evolutionary networks from intraspecific DNA sequences
Patrick Mardulyn ... Michel C Milinkovitch
-
Patrick Mardulyn, et. al.Patrick Mardulyn ... Michel C Milinkovitch
01 Jan 2001
01 Jan 2001

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An efficient and extensible approach for compressing phylogenetic trees

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics