Abstract

Word-based or ‘alignment-free’ sequence comparison has become an active research area in bioinformatics. While previous word-frequency approaches calculated rough measures of sequence similarity or dissimilarity, some new alignment-free methods are able to accurately estimate phylogenetic distances between genomic sequences. One of these approaches is Filtered Spaced Word Matches. Here, we extend this approach to estimate evolutionary distances between complete or incomplete proteomes; our implementation of this approach is called Prot-SpaM. We compare the performance of Prot-SpaM to other alignment-free methods on simulated sequences and on various groups of eukaryotic and prokaryotic taxa. Prot-SpaM can be used to calculate high-quality phylogenetic trees for dozens of whole-proteome sequences in a matter of seconds or minutes and often outperforms other alignment-free approaches. The source code of our software is available through Github: https://github.com/jschellh/ProtSpaM.

Highlights

  • Evolutionary relationships between species are usually inferred by comparing homologous gene or protein sequences to each other

  • Our approach is based on filtered spaced word matches (FSWM), a concept we introduced recently for whole-genome sequence comparison [33]; see [31, 35] for related approaches

  • We compared our program to the following four other alignment-free methods that can be run on protein sequences: Average Common Substring Approach (ACS) [21], Feature Frequency Profile (FFP) [36, 8], kmacs [22], and Composition Vector Tree (CVTree) [11]

Read more

Summary

Introduction

Evolutionary relationships between species are usually inferred by comparing homologous gene or protein sequences to each other. Groups of orthologous sequences have to be identified first, for which multiple alignments are to be calculated. There are generally two different strategies of resolving phylogenies based on multiple alignments. In the so-called supermatrix approach, multiple sequence alignments of single genes or proteins are concatenated. A phylogenetic tree is inferred from the resulting matrix, e.g., using maximum likelihood [1] or Bayesian inference [2]. Gene or protein trees are inferred for every single multiple sequence alignment, and the resulting phylogeny is inferred using coalescent models [3] or supertree [4] approaches

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call