The Effect of Sample Bias and Experimental Artefacts on the Statistical Phylogenetic Analysis of Picornaviruses.

Yulia Vakulenko,Andrei Deviatkin,Alexander Lukashev

doi:10.3390/v11111032

Abstract

Statistical phylogenetic methods are a powerful tool for inferring the evolutionary history of viruses through time and space. The selection of mathematical models and analysis parameters has a major impact on the outcome, and has been relatively well-described in the literature. The preparation of a sequence dataset is less formalized, but its impact can be even more profound. This article used simulated datasets of enterovirus sequences to evaluate the effect of sample bias on picornavirus phylogenetic studies. Possible approaches to the reduction of large datasets and their potential for introducing additional artefacts were demonstrated. The most consistent results were obtained using “smart sampling”, which reduced sequence subsets from large studies more than those from smaller ones in order to preserve the rare sequences in a dataset. The effect of sequences with technical or annotation errors in the Bayesian framework was also analyzed. Sequences with about 0.5% sequencing errors or incorrect isolation dates altered by just 5 years could be detected by various approaches, but the efficiency of identification depended upon sequence position in a phylogenetic tree. Even a single erroneous sequence could profoundly destabilize the whole analysis by increasing the variance of the inferred evolutionary parameters.

Highlights

The introduction of statistical phylogenetic methods over a decade ago allowed the timing of evolutionary events that occurred in the past to be elucidated by applying complex evolutionary and epidemiological models to contemporary sequences [1]
Other genome regions should not be used for statistical phylogenetics, unless the absence of recombination has been proven by analysis with a set of algorithms and by demonstrating congruence between phylogenies for the VP1 gene and a selected genome region
In the case of statistical phylogenetics, even much less severe sequencing errors can have a profound effect on the whole analysis; a quick recombination screening would improve the reliability of any study

Summary

Introduction

The introduction of statistical phylogenetic methods ( termed Bayesian phylogenetics) over a decade ago allowed the timing of evolutionary events that occurred in the past to be elucidated by applying complex evolutionary and epidemiological models to contemporary sequences [1]. This novel algorithm was especially well-suited for RNA viruses which acquire nucleotide substitutions at high rates, usually in the order of 10−2 to 10−5 substitutions/site/year (s/s/y) [2].

Methods

Results

Conclusion