Fast and accurate phylogeny reconstruction using filtered spaced-word matches

Chris-André Leimeister,Salma Sohrabi-Jahromi,Burkhard Morgenstern,Alfonso Valencia

doi:10.1093/bioinformatics/btw776

Chris-André Leimeister, Salma Sohrabi-Jahromi + Show 2 more

Open Access

https://doi.org/10.1093/bioinformatics/btw776

Copy DOI

Journal: Bioinformatics	Publication Date: Jan 4, 2017
Citations: 57	License type: CC BY-NC 4.0

Affiliation: University of Göttingen

Abstract

MotivationWord-based or ‘alignment-free’ algorithms are increasingly used for phylogeny reconstruction and genome comparison, since they are much faster than traditional approaches that are based on full sequence alignments. Existing alignment-free programs, however, are less accurate than alignment-based methods.ResultsWe propose Filtered Spaced Word Matches (FSWM), a fast alignment-free approach to estimate phylogenetic distances between large genomic sequences. For a pre-defined binary pattern of match and don’t-care positions, FSWM rapidly identifies spaced word-matches between input sequences, i.e. gap-free local alignments with matching nucleotides at the match positions and with mismatches allowed at the don’t-care positions. We then estimate the number of nucleotide substitutions per site by considering the nucleotides aligned at the don’t-care positions of the identified spaced-word matches. To reduce the noise from spurious random matches, we use a filtering procedure where we discard all spaced-word matches for which the overall similarity between the aligned segments is below a threshold. We show that our approach can accurately estimate substitution frequencies even for distantly related sequences that cannot be analyzed with existing alignment-free methods; phylogenetic trees constructed with FSWM distances are of high quality. A program run on a pair of eukaryotic genomes of a few hundred Mb each takes a few minutes.Availability and ImplementationThe program source code for FSWM including a documentation, as well as the software that we used to generate artificial genome sequences are freely available at http://fswm.gobics.de/Supplementary information Supplementary data are available at Bioinformatics online.

Full Text