Abstract
Reconstructing pathogen dynamics from genetic data as they become available during an outbreak or epidemic represents an important statistical scenario in which observations arrive sequentially in time and one is interested in performing inference in an “online” fashion. Widely used Bayesian phylogenetic inference packages are not set up for this purpose, generally requiring one to recompute trees and evolutionary model parameters de novo when new data arrive. To accommodate increasing data flow in a Bayesian phylogenetic framework, we introduce a methodology to efficiently update the posterior distribution with newly available genetic data. Our procedure is implemented in the BEAST 1.10 software package, and relies on a distance-based measure to insert new taxa into the current estimate of the phylogeny and imputes plausible values for new model parameters to accommodate growing dimensionality. This augmentation creates informed starting values and re-uses optimally tuned transition kernels for posterior exploration of growing data sets, reducing the time necessary to converge to target posterior distributions. We apply our framework to data from the recent West African Ebola virus epidemic and demonstrate a considerable reduction in time required to obtain posterior estimates at different time points of the outbreak. Beyond epidemic monitoring, this framework easily finds other applications within the phylogenetics community, where changes in the data—in terms of alignment changes, sequence addition or removal—present common scenarios that can benefit from online inference.
Highlights
Changes in data during ongoing research commonly occur in many fields of research, including phylogenetics
We evaluate the performance of our BEAST 1.10 online inference framework by analyzing complete genome data from the West African Ebola virus epidemic of 2013–2016
We present a framework for online Bayesian phylodynamic inference that accommodates a continuous data flow, as exemplified by an epidemic scenario where continued sampling efforts yield a series of genome sequences over time
Summary
Changes in data during ongoing research commonly occur in many fields of research, including phylogenetics. These typically include the addition of new sequences as they become available—for example, during a large sequencing study or through data sharing—and updates of alignments of existing sequences, possibly as a result of correcting sequencing errors. Such changes usually lead to the discarding of results obtained prior to the revision of the data, and recommencing statistical analyses completely from scratch (de novo). A promising avenue to mitigate this problem is an online phylogenetic inference framework that can accommodate data changes in existing analyses and leverage intermediate results to shorten the run times of updated inferences
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.