Scalable alignment-free approaches in microbial phylogenomics

Guillaume Bernard

doi:10.14264/uql.2017.933

Abstract

In the 1970s Carl Woese and colleagues discovered the third domain of life by comparing oligonucleotide catalogs of 16S/18S rRNAs. Four decades later, phylogenetic studies are mostly based on multiple sequence alignment (MSA) approaches. However, genome evolution in microbes involves highly dynamic molecular mechanisms including genome rearrangement and lateral genetic transfer (LGT). These mechanisms can potentially violate the implicit assumption of full-length contiguity in MSA. Furthermore, commonly used MSA-based approaches can necessitate the use of heuristic methods, e.g. Bayesian inference, in reconstructing phylogenies, and these may not be scalable to the quantity of existing and forthcoming genome data. In recent years, alignment-free (AF) methods have been developed as an alternative strategy to infer evolutionary relatedness based on shared subsequences of fixed length, known as k-mers, similarly to Woese’s preliminary work. In this thesis, I aimed to study the complex evolution of microbial genomes with the development of novel AF approaches, and systematic assessment of the AF methods’ potential for phylogenetic inference. This could potentially provide new insight onto microbial evolution and change the way we do phylogenomics, i.e. potentially lead to the development of “next-generation phylogenomics”. The thesis starts with a brief overview of the diversity of microbial life, and the difficulties in understanding microbial evolution due to complex phenomena such as LGT or rearrangement. I explain how phylogenomic approaches can be used to understand microbial evolution, and describe distinct approaches based on MSA and AF. The second chapter is a literature review of the conceptual foundations of alignment-free approaches for the inference of phylogenetic relationships of genome sequences. I discuss the limitations of MSA-based approaches, introduce the concept of k-mers, present in detail the different families of alignment-free approaches and describe their applications to infer vertical and lateral phylogenetic signal among microbial genomes. The three result chapters are presented in the form of research papers, each with its own introduction, methods, results and discussion. In the first research chapter, I examined the performance of AF approaches in recovering accurate phylogenies of bacterial protein and nucleotide sequences simulated under diverse evolutionary scenarios. I implemented an AF approach to infer phylogenies and compared the robustness of a class of AF methods, the D2 statistics, with an MSA-based approach against among-site rate heterogeneity, compositional biases, genetic rearrangements, insertions/deletions, sequence divergence and sequence truncation. I also assessed the scalability of these methods on simulated and empirical data. This work demonstrated that compared to a MSA approach, AF methods are more robust against among-site rate heterogeneity, compositional biases, genetic rearrangements and insertions/deletions, but are more sensitive to recent sequence divergence and sequence truncation. The AF methods were found to be accurate, scalable and computationally efficient. In the second research chapter, I systematically assessed the sensitivity and scalability of nine AF methods to genome-scale evolutionary events, including sequence divergence, LGT and rearrangement. The methods selected represent the two families of AF methods, those based on word counts (with exact or inexact k-mers) and those based on match lengths (with or without mismatches). I found that most AF methods are robust against rearrangement and a moderate amount of LGT, and I identified optimal parameters. I also examined the scalability of these methods at genome scale, and found that while remaining fast, their scalability differs between the two families. I also introduced a new application of the jackknife technique to provide node-support values to phylogenies inferred by AF approaches, and showed that these values are biologically meaningful. In the third results chapter, I implemented an AF approach (based on the [*] statistic) to infer phylogenomic networks for a large dataset of complete genomes of Bacteria and Archaea. I reconstructed a phylogenomic network of microbial life using 2785 completely sequenced bacterial and archaeal genomes, and systematically assessed the impact of ribosomal RNA and plasmid sequences in this network. By implementing and varying a distance threshold, I captured changes in the network structure, e.g. cliques, that reflect the evolutionary dynamics of microbial genomes. I linked the implicated k-mers to annotated genomic regions (thus functions) using a database approach, and defined the term core k-mers. These findings indicate that AF phylogenomics is not limited to tree inference, but can also provide new insight into microbial evolution by combining network analysis and the use of a relational k-mer database in a scalable manner.

Full Text