Abstract

BackgroundIn many fields of biomedical research, it is important to estimate phylogenetic distances between taxa based on low-coverage sequencing reads. Major applications are, for example, phylogeny reconstruction, species identification from small sequencing samples, or bacterial strain typing in medical diagnostics.ResultsWe adapted our previously developed software program Filtered Spaced-Word Matches (FSWM) for alignment-free phylogeny reconstruction to take unassembled reads as input; we call this implementation Read-SpaM.ConclusionsTest runs on simulated reads from semi-artificial and real-world bacterial genomes show that our approach can estimate phylogenetic distances with high accuracy, even for large evolutionary distances and for very low sequencing coverage.

Highlights

  • In many fields of biomedical research, it is important to estimate phylogenetic distances between taxa based on low-coverage sequencing reads

  • The usual workflow is as follows: DNA sequencing alignment-free approaches are based on k-mer statistics produces a large number of reads, these reads are [10,11,12,13,14,15,16], but there are approaches based on the length assembled to obtain contigs or complete genomes

  • As has been mentioned by various authors, an phylogeny-reconstruction methods such as Max- additional advantage of many alignment-free methods is imum Likelihood [4] are applied to these alignments to that they can be applied to assembled genome obtain a phylogenetic tree of the species under study. sequences, and to unassembled reads

Read more

Summary

Background

This procedure is time-consuming and error-prone, and it Phylogeny reconstruction is a basic task in biological requires manual input from highly-specialized experts. We adapted FSWM to compare unassembled reads to each other or to assembled genomes We call this implementation Read-SpaM (for Read-based Spaced-Word Matches). Benchmark Setup To evaluate Read-SpaM, we used simulated reads for three types of test scenarios: (1) Pairs of one real and one semiartificial genome, respectively, with known phylogenetic distances, to compare estimated distances to real distances for a large range of distance values, (2) pairs of real genomes from different strains of E. coli and (3) sets of 17 different bacterial taxa, where we used full genome sequences from 16 taxa and unassembled reads from a 17th taxon. Since FSWM will find more spaced-word matches per position in regions of high sequence similarity than in regions of lower similarity, the overall similarity between the sequences is over-estimated by the program, i.e. the estimated distances are too small To mitigate this bias, one can split the first genome into fragments and compare each fragment individually to the complete second genome. We applied Read-SpaM and FSWM to estimate phylogenetic distances within each data set, and calculated trees from these distance matrices with the Neighbor-Joining [51] implementation from the PHYLIP package [52]

Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call