TreeShrink: fast and accurate detection of outlier long branches in collections of phylogenetic trees

Uyen Mai,Siavash Mirarab

doi:10.1186/s12864-018-4620-2

Abstract

BackgroundSequence data used in reconstructing phylogenetic trees may include various sources of error. Typically errors are detected at the sequence level, but when missed, the erroneous sequences often appear as unexpectedly long branches in the inferred phylogeny.ResultsWe propose an automatic method to detect such errors. We build a phylogeny including all the data then detect sequences that artificially inflate the tree diameter. We formulate an optimization problem, called the k-shrink problem, that seeks to find k leaves that could be removed to maximally reduce the tree diameter. We present an algorithm to find the exact solution for this problem in polynomial time. We then use several statistical tests to find outlier species that have an unexpectedly high impact on the tree diameter. These tests can use a single tree or a set of related gene trees and can also adjust to species-specific patterns of branch length. The resulting method is called TreeShrink. We test our method on six phylogenomic biological datasets and an HIV dataset and show that the method successfully detects and removes long branches. TreeShrink removes sequences more conservatively than rogue taxon removal and often reduces gene tree discordance more than rogue taxon removal once the amount of filtering is controlled.ConclusionsTreeShrink is an effective method for detecting sequences that lead to unrealistically long branch lengths in phylogenetic trees. The tool is publicly available at https://github.com/uym2/TreeShrink.

Highlights

Sequence data used in reconstructing phylogenetic trees may include various sources of error
We show that TreeShrink improves the quality of gene trees effectively for phylogenomic datasets and can separate strains of HIV
We start by comparing the three tests currently implemented in TreeShrink

Summary

Introduction

Sequence data used in reconstructing phylogenetic trees may include various sources of error. The number of loci involved and the size of the trees make it impossible to carefully examine every sequence alignment and every gene tree manually. Such manual curation, even if possible, is subject to biases of the curator and poses challenges in reproducibility. Mai and Mirarab BMC Genomics 2018, 19(Suppl 5):272 analysts often devise creative (if ad-hoc) methods to find and remove erroneous data. Such data filtering should be treated with care because it may remove useful signal in addition to error [11], and it runs the risk of introducing biases. Beyond filtering based on sequences, detecting problematic species from reconstructed trees is possible

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Genomics	Publication Date: May 1, 2018
Citations: 243	License type: open-access

R Discovery Prime

R Discovery Prime

TreeShrink: fast and accurate detection of outlier long branches in collections of phylogenetic trees

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics

Lead the way for us

Similar Papers

QFT + NP = P Quantum Field Theory (QFT): A Possible Way of Solving NP-Complete Problems in Polynomial Time
Vladik Kreinovich ...
-
Vladik Kreinovich, et. al.Vladik Kreinovich ...
14 Oct 2017
14 Oct 2017

On compatibility and incompatibility of collections of unrooted phylogenetic trees
David Fernández-Baca ... Sudheer R Vakati
Discrete Applied Mathematics | VOL. 245
David Fernández-Baca, et. al.David Fernández-Baca ... Sudheer R Vakati
30 May 2017
Discrete Applied Mathematics | VOL. 245

Why phylogenomic uncertainty enhances introgression analyses.
James B Pease
Molecular ecology | VOL. 27
James B PeaseJames B Pease
01 Nov 2018
Molecular ecology | VOL. 27

Enumerating all maximal frequent subtrees in collections of phylogenetic trees.
Akshay Deepak ... David Fernández-Baca
Algorithms for molecular biology : AMB | VOL. 9
Akshay Deepak, et. al.Akshay Deepak ... David Fernández-Baca
18 Jun 2014
Algorithms for molecular biology : AMB | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

TreeShrink: fast and accurate detection of outlier long branches in collections of phylogenetic trees

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics