Efficient Detection of Repeating Sites to Accelerate Phylogenetic Likelihood Calculations.

K Kobert,A Stamatakis,T Flouri

doi:10.1093/sysbio/syw075

Abstract

The phylogenetic likelihood function (PLF) is the major computational bottleneck in several applications of evolutionary biology such as phylogenetic inference, species delimitation, model selection, and divergence times estimation. Given the alignment, a tree and the evolutionary model parameters, the likelihood function computes the conditional likelihood vectors for every node of the tree. Vector entries for which all input data are identical result in redundant likelihood operations which, in turn, yield identical conditional values. Such operations can be omitted for improving run-time and, using appropriate data structures, reducing memory usage. We present a fast, novel method for identifying and omitting such redundant operations in phylogenetic likelihood calculations, and assess the performance improvement and memory savings attained by our method. Using empirical and simulated data sets, we show that a prototype implementation of our method yields up to 12-fold speedups and uses up to 78% less memory than one of the fastest and most highly tuned implementations of the PLF currently available. Our method is generic and can seamlessly be integrated into any phylogenetic likelihood implementation.

Highlights

Apart from the aforementioned standard techniques, there are several studies on improving the run-time of the phylogenetic likelihood function (PLF). (Sumner and Charleston 2010) presented a method that relies on partial likelihood tensors
This means that the proposed column sorting may not yield the maximum amount of savings. (Larget and Simon 1998) propose another algorithm that considers site repeats. At every node their method builds one bit-mask for each site in the alignment. Since this process relies on constructing and manipulating large bit-vectors at every node, and relies on sorting them for finding identical entries, it incurs a high computational overhead. (Valle et al 2014) present another method that focuses on positive selection analysis, and that deploys a variation of site repeats to accelerate the PLF
The first variant (SRDT) assumes no prior knowledge of the site repeats of a tree topology, and computes them before each PLF call. This variation is required for tree space exploration as site repeats change every time the tree topology is modified

Summary

Introduction

Apart from the aforementioned standard techniques, there are several studies on improving the run-time of the PLF. (Sumner and Charleston 2010) presented a method that relies on partial likelihood tensors. Simple algorithm that satisfies the efficiency properties described above; it detects identical sites at any node of the phylogenetic tree and at the (selected) root, and minimizes the number of operations required for likelihood evaluation. We will use the basic algorithm REPEATS to gradually build the complete method, that performs a post-order traversal over all nodes of tree T, and which incorporates the memory saving technique to reduce memory requirements.

Results

Conclusion