Abstract

The study of the evolutionary interrelations of living organisms has been at the heart of biological sciences all along. A revolution in sequencing techniques in the past decades has caused a massive increase in molecular sequence data. As a result, contemporary methods assess evolutionary relationships between organisms by quantifying the degree of similarity between their biological sequence data. The discovered relationships of phylogenetic studies are commonly represented and visualized by phylogenetic trees or networks. Traditionally, sequences have been extracted from single organisms; however, recent technological progress has enabled the retrieval of sequence data directly from environmental samples. In doing so, large numbers of short sequencing reads arise that may originate from all organisms present in the respective environment. One major subsequent objective is the taxonomic or phylogenetic identification of those sequencing reads. However, longstanding maximum-likelihood-based de-novo phylogeny reconstruction methods are limited in their applicability by their computational demands; typically, they cannot be applied when the available molecular sequences are present in great numbers or are of great length. Fortunately, phylogenetic placement offers a unique approach to identify large sets of query reads within their phylogenetic context by inserting them into an existing phylogenetic tree comprising a set of reference sequences. Here, we present a new alignment- and assembly-free approach to phylogenetic placement, the Alignment-free phylogenetic placement algorithm based on Spaced-word Matches (App-SpaM). App-SpaM extracts short, non-contiguous subwords to detect homologies between the query and reference sequences, a method known as the spaced-word matches approach. It counts the number of such words and utilizes them to infer the average number of nucleotide substitutions between each read and each reference sequence. Then, it uses fast heuristics to infer a suitable placement position within the reference tree. We assessed how App-SpaM compares to existing algorithms for phylogenetic placement with respect to accuracy and computation speed in a comprehensive evaluation. We demonstrate that App-SpaM is on par with maximum- likelihood-based algorithms on metataxonomic data sets. In addition, App-SpaM is two to three orders of magnitude faster than the next fastest programs while its memory demands stay low. We extensively discuss App-SpaM’s advantages and drawbacks and propose several additional features to improve upon its original version: For this, we evaluate a set of novel placement heuristics, the use of sampling techniques to allow an improved scalability with the length of the reference sequences, and a measure for the uncertainty of proposed placement positions. Subsequently, we present a variety of novel use cases of phylogenetic that are made uniquely possible by App-SpaM’s versatility with respect to its potential input data. These applications include, in particular, the iterative augmentation of existing species trees by means of phylogenetic placement and the screening for outlier genes or species prior to phylogeny reconstruction.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call