Abstract

Advances in sequencing have generated a large number of complete genomes. Traditionally, phylogenetic analysis relies on alignments of orthologs, but defining orthologs and separating them from paralogs is a complex task that may not always be suited to the large datasets of the future. An alternative to traditional, alignment-based approaches are whole-genome, alignment-free methods. These methods are scalable and require minimal manual intervention. We developed SlopeTree, a new alignment-free method that estimates evolutionary distances by measuring the decay of exact substring matches as a function of match length. SlopeTree corrects for horizontal gene transfer, for composition variation and low complexity sequences, and for branch-length nonlinearity caused by multiple mutations at the same site. We tested SlopeTree on 495 bacteria, 73 archaea, and 72 strains of Escherichia coli and Shigella. We compared our trees to the NCBI taxonomy, to trees based on concatenated alignments, and to trees produced by other alignment-free methods. The results were consistent with current knowledge about prokaryotic evolution. We assessed differences in tree topology over different methods and settings and found that the majority of bacteria and archaea have a core set of proteins that evolves by descent. In trees built from complete genomes rather than sets of core genes, we observed some grouping by phenotype rather than phylogeny, for instance with a cluster of sulfur-reducing thermophilic bacteria coming together irrespective of their phyla. The source-code for SlopeTree is available at: http://prodata.swmed.edu/download/pub/slopetree_v1/slopetree.tar.gz.

Highlights

  • Learning how to obtain complete genomes was a critical step to understanding biology and was achieved as early as 1977 for the genome of bacteriophage fX174 [1]

  • Due to their lack of distinct morphological features, bacteria and archaea were extremely difficult to classify until technology was developed to obtain their DNA sequences; these sequences could be compared to estimate evolutionary relationships

  • In the recent work of Lang and Eisen [25], an analysis of ~900 diverse prokaryotes from both bacteria and archaea identified only 24 suitable genes. These consisted of a subset of ribosomal proteins, two translation factors that both interact with the ribosome, and the alpha subunit of a phenylalanyl-tRNA synthetase which was the only protein in the set not interacting with the ribosome and which contributed only ~5% of the overall alignment used to generate phylogeny

Read more

Summary

Introduction

Learning how to obtain complete genomes was a critical step to understanding biology and was achieved as early as 1977 for the genome of bacteriophage fX174 [1]. In the recent work of Lang and Eisen [25], an analysis of ~900 diverse prokaryotes from both bacteria and archaea identified only 24 suitable (i.e. paralog-free) genes These consisted of a subset of ribosomal proteins, two translation factors that both interact with the ribosome, and the alpha subunit of a phenylalanyl-tRNA synthetase which was the only protein in the set not interacting with the ribosome and which contributed only ~5% of the overall alignment used to generate phylogeny. The sorted, merged k-mer list derived from the scrambled proteins is passed through the SlopeTree match-counting algorithm (Algorithm 3), generating its own set of histograms in which the evolutionarily conserved sequences have been completely erased. SlopeTree’s background correction consists of subtracting the counts from the histograms obtained from randomized sequences from the histograms obtained from real data (Fig 1A and 1B)

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call