Evolutionary Distances in the Twilight Zone—A Rational Kernel Approach

Roland F Schwarz,William Fletcher,Florian Markowetz,Matthias Wolf,Benjamin Merget,Jörg Schultz,Frank Förster,Wayne Delport

doi:10.1371/journal.pone.0015788

Roland F Schwarz, William Fletcher + Show 6 more

Open Access

https://doi.org/10.1371/journal.pone.0015788

Copy DOI

Abstract

Phylogenetic tree reconstruction is traditionally based on multiple sequence alignments (MSAs) and heavily depends on the validity of this information bottleneck. With increasing sequence divergence, the quality of MSAs decays quickly. Alignment-free methods, on the other hand, are based on abstract string comparisons and avoid potential alignment problems. However, in general they are not biologically motivated and ignore our knowledge about the evolution of sequences. Thus, it is still a major open question how to define an evolutionary distance metric between divergent sequences that makes use of indel information and known substitution models without the need for a multiple alignment. Here we propose a new evolutionary distance metric to close this gap. It uses finite-state transducers to create a biologically motivated similarity score which models substitutions and indels, and does not depend on a multiple sequence alignment. The sequence similarity score is defined in analogy to pairwise alignments and additionally has the positive semi-definite property. We describe its derivation and show in simulation studies and real-world examples that it is more accurate in reconstructing phylogenies than competing methods. The result is a new and accurate way of determining evolutionary distances in and beyond the twilight zone of sequence alignments that is suitable for large datasets.

Highlights

State-of-the art phylogenetic reconstruction methods are currently being challenged
We make use of two major observations: (i) The classical problem of pairwise alignment can be posed as a shortest-path problem on a logweighted finite-state transducers (FST) [25]; and (ii) FSTs that can be decomposed into another FST and its inverse give rise to pd rational kernels [26]
The studied methods were (i) traditional multiple alignment using Muscle [32] followed by Jukes-Cantor distance estimation using Phylip [33], (ii) statistical consistency alignment using ProbCons based on pair-Hidden Markov Models (HMMs) [34] followed by RAxML maximum-likelihood tree reconstruction [35], (iii) an alignmentfree method of distance estimation based on the Lempel-Ziv complexity [12], (iv) a pattern-based maximum-likelihood approach for alignment-free distance estimation [36] and (v) the classical Levenshtein distance [28]

Summary

Introduction

State-of-the art phylogenetic reconstruction methods are currently being challenged. Multiple sequence alignments followed by maximum-likelihood (ML) tree reconstruction have been seen as the computationally expensive gold standard for phylogenetic analyses [1,2]. The expected required sequence length for the reconstructed tree to converge to the true tree phylogeny is not worse in distance-based approaches than in ML [4]. The quality of the multiple sequence alignment heavily affects reconstruction accuracy, a situation worsened by the NP-hardness of the alignment problem and the heuristics used to cope with it [5,6,7,8,9]. The problem of alignment errors arises especially on large-scale phylogenies with many taxa that span a broad divergence range [10], where many homologies lie in the twilight-zone of sequence alignments [11]

Methods

Results

Conclusion