Prot-SpaM: fast alignment-free phylogeny reconstruction based on whole-proteome sequences.

Chris-Andre Leimeister,Burkhard Morgenstern,Christoph Bleidorn,Jendrik Schellhorn,Svenja Dörrer,Michael Gerth

doi:10.1093/gigascience/giy148

Chris-Andre Leimeister, Burkhard Morgenstern + Show 4 more

Open Access

https://doi.org/10.1093/gigascience/giy148

Copy DOI

Journal: GigaScience	Publication Date: Dec 7, 2018
Citations: 22	License type: CC BY 4.0

Affiliation: University of Göttingen, University of Liverpool

Abstract

Word-based or ‘alignment-free’ sequence comparison has become an active research area in bioinformatics. While previous word-frequency approaches calculated rough measures of sequence similarity or dissimilarity, some new alignment-free methods are able to accurately estimate phylogenetic distances between genomic sequences. One of these approaches is Filtered Spaced Word Matches. Here, we extend this approach to estimate evolutionary distances between complete or incomplete proteomes; our implementation of this approach is called Prot-SpaM. We compare the performance of Prot-SpaM to other alignment-free methods on simulated sequences and on various groups of eukaryotic and prokaryotic taxa. Prot-SpaM can be used to calculate high-quality phylogenetic trees for dozens of whole-proteome sequences in a matter of seconds or minutes and often outperforms other alignment-free approaches. The source code of our software is available through Github: https://github.com/jschellh/ProtSpaM.

Highlights

Evolutionary relationships between species are usually inferred by comparing homologous gene or protein sequences to each other
Our approach is based on filtered spaced word matches (FSWM), a concept we introduced recently for whole-genome sequence comparison [33]; see [31, 35] for related approaches
We compared our program to the following four other alignment-free methods that can be run on protein sequences: Average Common Substring Approach (ACS) [21], Feature Frequency Profile (FFP) [36, 8], kmacs [22], and Composition Vector Tree (CVTree) [11]

Summary

Introduction

Evolutionary relationships between species are usually inferred by comparing homologous gene or protein sequences to each other. Groups of orthologous sequences have to be identified first, for which multiple alignments are to be calculated. There are generally two different strategies of resolving phylogenies based on multiple alignments. In the so-called supermatrix approach, multiple sequence alignments of single genes or proteins are concatenated. A phylogenetic tree is inferred from the resulting matrix, e.g., using maximum likelihood [1] or Bayesian inference [2]. Gene or protein trees are inferred for every single multiple sequence alignment, and the resulting phylogeny is inferred using coalescent models [3] or supertree [4] approaches

Methods

Results

Conclusion