Towards Making Sense of "The Tree of Life"

Stephane Guindon

doi:10.15200/winn.140096.68385

Abstract

I started working on PhyML during my second year as a PhD student. The article describing the first part of my PhD thesis had just been published and I felt it was the right time to take some risk and try something which first seemed out of my depth: implementing a program that calculates the phylogenetic likelihood function. In 2002, only very few softwares were based on the likelihood principle. The calculation of this function appeared to me as a tough challenge, but the underlying algorithm (Felsenstein's prunning algorithm ( Felsenstein 1981 )) is beautiful and I was thus eager to test my programming skills on that nice problem. I was based in Montpellier, in the south of France, at that time, but my wife lived in Paris which means I was spending a lot of time away from the lab. This freedom gave me the opportunity to immerse myself completely in my task. I remember being in Paris, not far from the Sacre Coeur, crunching numbers and, for the first time, having my own program return the very same likelihood value as that produced by PAML ( Yang 2007 ) and PHYLIP ( Felsenstein 2005 ), the references in likelihood-based phylogenetic softwares. This felt like a very significant victory to me. I was hooked. I thus continued programming and tried to accommodate for larger data sets and apply more sophisticated parameter estimation techniques. It quickly appeared though that conventional algorithms would not allow me to analyze data sets with more than ~10 sequences. Other methods, that do not rely on the likelihood framework, could easily go up to ~100 sequences but lacked accuracy. A significant speed up in likelihood-based phylogenetic analyses was therefore in dire need. The core of my program relied on functions that would modify the current solution one step at a time, with each step applying the same operation to a new part of the phylogenetic tree. In order to save computing time, I decided to slightly modify that core and apply these multiple local operations all at the same time. Surprisingly, the results turned out to be very encouraging: the new algorithm was not only as accurate as the other likelihood-based softwares, it was also an order of magnitude faster. I remember then proudly showing the first results to Olivier Gascuel, my PhD supervisor. He was quite enthusiastic too and suggested further optimization strategies that significantly improved the algorithm. PhyML was born. Olivier then wrote most of the paper while I was running extensive simulations that would compare the

Full Text