Navigating Sampling Bias in Discrete Phylogeographic Analysis: Assessing the Performance of an Adjusted Bayes Factor

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Bayesian phylogeographic inference is widely used in molecular epidemiological studies to reconstruct the dispersal history of pathogens. Discrete phylogeographic analysis treats geographic locations as discrete traits and infers lineage transition events among them, and is typically followed by a Bayes factor (BF) test to assess the statistical support. In the standard BF (BFstd) test, the relative abundance of the involved trait states is not considered, which can be problematic in the case of unbalanced sampling. Existing methods to correct sampling bias in discrete phylogeographic analyses using continuous-time Markov chain (CTMC) model, often require additional epidemiological information to balance the sampling effort among locations. As such data is not necessarily available, alternative approaches that rely solely on available genomic data are needed. In this perspective, we assess the performance of a modification of the BFstd, the adjusted Bayes factor (BFadj), which incorporates information on the relative abundance of samples by location when inferring support for transition events and root location inference without requiring additional data. Using a simulation framework, we assess the statistical performance of BFstd and BFadj under varying levels of sampling bias, estimating their type I and type II error rates. Our results show that BFadj complements the BFstd by reducing type I errors at the cost increasing type II errors for inferred transition events, while improving type I and type II errors in root location inference. Our findings provide guidelines for implementing the complementary BFadj to detect and mitigate sampling bias in discrete phylogeographic inference using CTMC modeling.

Similar Papers
  • Preprint Article
  • Cite Count Icon 1
  • 10.1101/2025.04.23.650183
Navigating sampling bias in discrete phylogeographic analysis: assessing the performance of an adjusted Bayes factor
  • Apr 24, 2025
  • Fabiana Gámbaro + 4 more

Bayesian phylogeographic inference is widely used in molecular epidemiological studies to reconstruct the dispersal history of pathogens. Discrete phylogeographic analysis treats geographic locations as discrete traits and infers lineage transition events among them, and is typically followed by a Bayes factor (BF) test to assess the statistical support. In the standard BF (BFstd) test, the relative abundance of the involved trait states is not considered, which can be problematic in the case of unbalanced sampling. Existing methods to correct sampling bias in discrete phylogeographic analyses using continuous-time Markov chain (CTMC) model, often require additional epidemiological information to balance the sampling effort among locations. As such data is not necessarily available, alternative approaches that rely solely on available genomic data are needed. In this perspective, we assess the performance of a modification of the BFstd, the adjusted Bayes factor (BFadj), which incorporates information on the relative abundance of samples by location when inferring support for transition events and root location inference without requiring additional data. Using a simulation framework, we assess the statistical performance of BFstdand BFadjunder varying levels of sampling bias, estimating their type I and type II error rates. Our results show that BFadjcomplements the BFstdby reducing type I errors at the cost increasing type II errors for inferred transition events, while improving type I and type II errors in root location inference. Our findings provide guidelines for implementing the complementary BFadjto detect and mitigate sampling bias in discrete phylogeographic inference using CTMC modelling.

  • Research Article
  • 10.5755/j01.itc.43.2.3198
Continuous Time Markov Chain Models of Voltage Gating of Gap Junction Channels
  • Jun 19, 2014
  • Information Technology And Control
  • H Pranevicius + 5 more

The major goal of this study was to create a continuous time Markov chain (CTMC) models of voltage gating of gap junction (GJ) channels formed of connexin protein. This goal was achieved by using the Piece Linear Aggregate (PLA) formalism to describe the function of GJs and transforming PLA into Markov process. Infinitesimal generator of CTMC was used to automate construction of Markov chain model from description of the system using PLA formalism. Developed Markov chain models were used to simulate gap junctional conductance dependence on transjunctional voltage. The proposed method was implemented to create models of voltage gating of GJ channels containing 4 and 12 gates. CTMC modeling results were compared with the results obtained using a discrete time Markov chain (DTMC) model. It was shown that CTMC modeling requires less CPU time than an analogous DTMC model. DOI: http://dx.doi.org/10.5755/j01.itc.43.2.3198

  • Research Article
  • 10.1504/ijcat.2019.10024322
A multi-states continuous time Markov chain model for secondary spectrum access in dynamic spectrum access networks
  • Jan 1, 2019
  • International Journal of Computer Applications in Technology
  • Sabir Ghauri + 3 more

Dynamic Spectrum Access (DSA) networks are vulnerable to hackers who normally pretend themselves to be the primary users and called the Primary User Emulation Attack (PUEA). Research communities have already reported a vast use of PUEA in the existing research. Other potential attackers such as greedy users should not be ignored when investigating the dynamic spectrum access networks. In this paper, we propose a multi-states Continuous Time Markov Chain (CTMC) model to describe the behaviour of DSA, analysis of the channel states and discussion on the impacts of normal, normal greedy and greedy malicious users in DSA network. The CTMC model is simulated and the simulation results have been discussed and validated by comparing with the existing models. Finally, it is proved that CTMC model is an improved method to analyse the performance of the DSA networks when PUEA occurs.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/glocomw.2017.8269198
Throughput Analysis of Dense WLANs Using Continuous-Time Markov Chain Model
  • Dec 1, 2017
  • Tao Lei + 4 more

This paper analyzes the throughput performance of dense Wireless Local Area Networks (WLANs) with random topologies using the continuous time Markov chain (CTMC) model. Because the main factor that affects the accuracy of throughput analysis of dense WLANs is the interference caused by the simultaneous transmitting access points (APs). The accuracy decreases as the interference becomes increasingly significant. Therefore, we introduce coverage probability model into the CTMC model, thus capturing the effects of interference on the transmission process. In order to capture all the feasible CTMC states of different network topologies, we present a Feasible State Searching Algorithm (FSSA) which can calculate the number of states that each basic service set (BSS) belongs to and the total number of states in a CTMC model. Simulation results indicate that the proposed CTMC based throughput analysis method can capture the throughput properties of dense WLANs with random topologies.

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/synasc.2017.00059
An Approach to Performance Evaluation Programming
  • Sep 1, 2017
  • Eneia Nicolae Todoran

We present an experimental concurrent programming language LPEP which supports performance evaluation programming. An LPEP program is a collection of modules which are executed concurrently. The structure of an LPEP program is similar to the structure of a Continuous Time Markov Chain (CTMC) model expressed in the PRISM language (of the PRISM probabilistic model checker). However, LPEP is a programming language, not a model checker. Activities are abstracted in a PRISM CTMC model by their rates. In the language LPEP an activity is the evaluation of a function (expressed in a functional sub-language of LPEP ) in a certain state of a module. Each LPEP module contains a number of variables (which describe the possible states of the module) and a list of commands that can specify activities. LPEP supports variables of both primitive types (booleans and integers) and non-primitive types, e.g., lists. For the purpose of performance evaluation the programmer must bound the ranges of variables. Any value of a non-primitive type is handled according to its complexity (rather than its value). For example, the complexity of a list could be the length of the list, which must also be bounded. It is also the responsibility of the programmer to design each LPEP function by induction on an appropriate complexity measure which always decreases upon a recursive call. Performance evaluation is supported in LPEP by constructing a CTMC model from an LPEP program, and by analyzing the CTMC model using the PRISM tool.

  • Research Article
  • Cite Count Icon 2
  • 10.4028/www.scientific.net/amr.225-226.1024
Markov Process-Based Availability Analysis of Rendering Cluster Systems
  • Apr 1, 2011
  • Advanced Materials Research
  • Yong Bin Wang + 2 more

Considering common mode failure (CMF) in the rendering cluster systems, the availability of rendering cluster systems with the increase of cluster’s number was studied. Firstly, based on availability of system with one cluster node, system with two cluster nodes was modeled with continuous time markov chain (CTMC) model. Then, the CTMC model was extended to the case of system with three cluster nodes. Furthermore, by solving these three CTMC models, availability for different cases were numerically deduced. Additionally, during one year the unavailable time for different cases was calculated and analysis by comparison was conducted. Finally, conclusions on different cases’ advantages and disadvantages are derived thereby, which offers theoretical foundations for establishing rendering cluster systems.

  • Research Article
  • 10.1504/ijcat.2019.102843
A multi-states continuous time Markov chain model for secondary spectrum access in dynamic spectrum access networks
  • Jan 1, 2019
  • International Journal of Computer Applications in Technology
  • Hui Sun + 3 more

Dynamic Spectrum Access (DSA) networks are vulnerable to hackers who normally pretend themselves to be the primary users and called the Primary User Emulation Attack (PUEA). Research communities have already reported a vast use of PUEA in the existing research. Other potential attackers such as greedy users should not be ignored when investigating the dynamic spectrum access networks. In this paper, we propose a multi-states Continuous Time Markov Chain (CTMC) model to describe the behaviour of DSA, analysis of the channel states and discussion on the impacts of normal, normal greedy and greedy malicious users in DSA network. The CTMC model is simulated and the simulation results have been discussed and validated by comparing with the existing models. Finally, it is proved that CTMC model is an improved method to analyse the performance of the DSA networks when PUEA occurs.

  • Research Article
  • Cite Count Icon 11
  • 10.1002/sim.6571
Analysis of transtheoretical model of health behavioral changes in a nutrition intervention study--a continuous time Markov chain model with Bayesian approach.
  • Jun 29, 2015
  • Statistics in Medicine
  • Junsheng Ma + 4 more

Continuous time Markov chain (CTMC) models are often used to study the progression of chronic diseases in medical research but rarely applied to studies of the process of behavioral change. In studies of interventions to modify behaviors, a widely used psychosocial model is based on the transtheoretical model that often has more than three states (representing stages of change) and conceptually permits all possible instantaneous transitions. Very little attention is given to the study of the relationships between a CTMC model and associated covariates under the framework of transtheoretical model. We developed a Bayesian approach to evaluate the covariate effects on a CTMC model through a log-linear regression link. A simulation study of this approach showed that model parameters were accurately and precisely estimated. We analyzed an existing data set on stages of change in dietary intake from the Next Step Trial using the proposed method and the generalized multinomial logit model. We found that the generalized multinomial logit model was not suitable for these data because it ignores the unbalanced data structure and temporal correlation between successive measurements. Our analysis not only confirms that the nutrition intervention was effective but also provides information on how the intervention affected the transitions among the stages of change. We found that, compared with the control group, subjects in the intervention group, on average, spent substantively less time in the precontemplation stage and were more/less likely to move from an unhealthy/healthy state to a healthy/unhealthy state.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/ictc57116.2023.10154884
High-density Multi-layer Heterogeneous Wireless Networks Queue Modeling
  • May 17, 2023
  • Shuguang Fang

With the increasingly complex and diverse wireless traffic requirement, high-density multi-layer heterogeneous wireless networks (HetNet), making modeling and performance analysis complexity, is the main structure for wireless networks. This paper propose a queue technology based continuous time Markov chain (CTMC) model for the relationship between users and multi-layer heterogeneous wireless networks. The complexity of communication rates expression in this CTMC model makes network performance analyze difficult. Therefore, based on this original CTMC model, the paper proposes two simplified CTMC models defined in the countable domain, which have mutual dominance relationship with the original model, making network performance analysis simple and feasible. The paper verifies the mutual dominance relationship between models through mathematical proof and numerical analysis.

  • Research Article
  • Cite Count Icon 8
  • 10.1016/j.peva.2005.11.004
Two methods for computing bounds for the distribution of cumulative reward for large Markov models
  • Jan 18, 2006
  • Performance Evaluation
  • Juan A Carrasco

Two methods for computing bounds for the distribution of cumulative reward for large Markov models

  • Research Article
  • Cite Count Icon 43
  • 10.1093/ve/vead010
Impact and mitigation of sampling bias to determine viral spread: Evaluating discrete phylogeography through CTMC modeling and structured coalescent model approximations.
  • Feb 6, 2023
  • Virus Evolution
  • Maylis Layan + 6 more

Bayesian phylogeographic inference is a powerful tool in molecular epidemiological studies, which enables reconstruction of the origin and subsequent geographic spread of pathogens. Such inference is, however, potentially affected by geographic sampling bias. Here, we investigated the impact of sampling bias on the spatiotemporal reconstruction of viral epidemics using Bayesian discrete phylogeographic models and explored different operational strategies to mitigate this impact. We considered the continuous-time Markov chain (CTMC) model and two structured coalescent approximations (Bayesian structured coalescent approximation [BASTA] and marginal approximation of the structured coalescent [MASCOT]). For each approach, we compared the estimated and simulated spatiotemporal histories in biased and unbiased conditions based on the simulated epidemics of rabies virus (RABV) in dogs in Morocco. While the reconstructed spatiotemporal histories were impacted by sampling bias for the three approaches, BASTA and MASCOT reconstructions were also biased when employing unbiased samples. Increasing the number of analyzed genomes led to more robust estimates at low sampling bias for the CTMC model. Alternative sampling strategies that maximize the spatiotemporal coverage greatly improved the inference at intermediate sampling bias for the CTMC model, and to a lesser extent, for BASTA and MASCOT. In contrast, allowing for time-varying population sizes in MASCOT resulted in robust inference. We further applied these approaches to two empirical datasets: a RABV dataset from the Philippines and a SARS-CoV-2 dataset describing its early spread across the world. In conclusion, sampling biases are ubiquitous in phylogeographic analyses but may be accommodated by increasing the sample size, balancing spatial and temporal composition in the samples, and informing structured coalescent models with reliable case count data.

  • Conference Article
  • Cite Count Icon 1
  • 10.1145/3203217.3203236
Modeling SSD RAID reliability under general settings
  • May 8, 2018
  • Zhiyong Wu + 3 more

Solid-state drives (SSDs) are susceptible to the limited number of program/erase (P/E) cycles and uncorrectable flash errors, and hence achieving high reliability of SSD storage systems is a critical issue. RAID provides a viable option for enhancing system reliability by distributing redundancy across a number of SSDs. However, the flash error rate of an SSD increases with the number of P/E cycles, and this time-varying nature complicates the reliability analysis of SSD RAID. In addition, there remains very limited formal analysis that quantifies the reliability dynamics of an SSD RAID array under general settings. To this end, we propose a new continuous time Markov chain (CTMC) model to characterize the reliability dynamics of SSD RAID over time under two general settings: (1) fault tolerance against a general number of device failures and (2) non-uniform workload. We validate the correctness of our CTMC model via trace-driven simulations. Based on our model, we further analyze the impact of different RAID parameters on the reliability dynamics of an SSD RAID array.

  • Research Article
  • Cite Count Icon 9
  • 10.1101/2020.06.22.165464
Accommodating individual travel history, global mobility, and unsampled diversity in phylogeography: a SARS-CoV-2 case study.
  • Jun 23, 2020
  • bioRxiv
  • Philippe Lemey + 12 more

Spatiotemporal bias in genome sequence sampling can severely confound phylogeographic inference based on discrete trait ancestral reconstruction. This has impeded our ability to accurately track the emergence and spread of SARS-CoV-2, the virus responsible for the COVID-19 pandemic. Despite the availability of unprecedented numbers of SARS-CoV-2 genomes on a global scale, evolutionary reconstructions are hindered by the slow accumulation of sequence divergence over its relatively short transmission history. When confronted with these issues, incorporating additional contextual data may critically inform phylodynamic reconstructions. Here, we present a new approach to integrate individual travel history data in Bayesian phylogeographic inference and apply it to the early spread of SARS-CoV-2, while also including global air transportation data. We demonstrate that including travel history data for each SARS-CoV-2 genome yields more realistic reconstructions of virus spread, particularly when travelers from undersampled locations are included to mitigate sampling bias. We further explore methods to ameliorate the impact of sampling bias by augmenting the phylogeographic analysis with lineages from undersampled locations in the analyses. Our reconstructions reinforce specific transmission hypotheses suggested by the inclusion of travel history data, but also suggest alternative routes of virus migration that are plausible within the epidemiological context but are not apparent with current sampling efforts. Although further research is needed to fully examine the performance of our travel-aware phylogeographic analyses with unsampled diversity and to further improve them, they represent multiple new avenues for directly addressing the colossal issue of sample bias in phylogeographic inference.

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/pesgm.2016.7741245
A continuous time Markov chain based sequential analytical approach for composite power system reliability assessment
  • Jul 1, 2016
  • Kai Hou + 4 more

This paper proposes a continuous time Markov chain (CTMC) based sequential analytical approach for composite generation and transmission systems reliability assessment. The basic idea is to construct a CTMC model for the composite system. Based on this model, sequential analyses are performed. Various kinds of reliability indices can be obtained, including expectation, variance, frequency, duration and probability distribution. In order to reduce the dimension of the state space, traditional CTMC modeling approach is modified by merging all high order contingencies into a single state, which can be calculated by Monte Carlo simulation (MCS). Then a state mergence technique is developed to integrate all normal states to further reduce the dimension of the CTMC model. Moreover, a time discretization method is presented for the CTMC model calculation. Case studies are performed on the RBTS and a modified IEEE 300 bus test system. The results indicate that sequential reliability assessment can be performed by the proposed approach. Comparing with the traditional sequential Monte Carlo simulation method, the proposed method is more efficient, especially in small scale or very reliable power systems.

  • Research Article
  • Cite Count Icon 1
  • 10.1007/s10479-013-1496-z
Continuous-time Markov chain models to estimate the premium for extended hedge fund lockups
  • Nov 14, 2013
  • Annals of Operations Research
  • Kun Soo Park + 1 more

A lockup period for investment in a hedge-fund is a time period after making the investment during which an investor cannot freely redeem his investment. Since long lockup periods have recently been imposed, it is important to estimate the premium an investor should expect from extended lockups. For this, Derman et al. (Wilmott J. 1(5–6):263–293, 2009) proposed a parsimonious three-state discrete-time Markov Chain (DTMC) to model the state of a hedge fund, allowing the state to change randomly among the states “good,” “sick” and “dead” every year. In this paper, we propose an alternative three-state absorbing continuous-time Markov Chain (CTMC) model, which allows state changes continuously in time instead of yearly. Allowing more dynamic state changes is more realistic, but the CTMC model requires new techniques for parameter fitting. We employ nonlinear programming to solve the new calibration equations. We show that the more realistic CTMC model is a viable alternative to the previous DTMC model for estimating the premium for extended hedge fund lockups.

More from: Molecular Biology and Evolution
  • New
  • Addendum
  • 10.1093/molbev/msaf268
Correction to: Reconstruction of Microbial Haplotypes by Integration of Statistical and Physical Linkage in Scaffolding
  • Nov 4, 2025
  • Molecular Biology and Evolution

  • New
  • Research Article
  • 10.1093/molbev/msaf284
Genomic features underlying the origin of sociality and the diversification of caste systems in termites.
  • Nov 3, 2025
  • Molecular biology and evolution
  • Kokuto Fujiwara + 7 more

  • New
  • Research Article
  • 10.1093/molbev/msaf285
Stable hypermutators revealed by the genomic landscape of genes involved in genome stability among yeast species.
  • Nov 3, 2025
  • Molecular biology and evolution
  • Carla Gonçalves + 12 more

  • New
  • Research Article
  • 10.1093/molbev/msaf283
Identifying single origin rare variants in population genomic data.
  • Nov 3, 2025
  • Molecular biology and evolution
  • Josh J Reynolds + 2 more

  • New
  • Research Article
  • 10.1093/molbev/msaf281
SARS-CoV-2 Evolution in Humans Enables its Transmission to Nonhuman Primates.
  • Nov 1, 2025
  • Molecular biology and evolution
  • Yu-Ting Chiu + 9 more

  • New
  • Research Article
  • 10.1093/molbev/msaf276
An evolutionarily conserved laterally acquired toolkit enables microbiota targeting by Trichomonas.
  • Oct 30, 2025
  • Molecular biology and evolution
  • Adam J Hart + 8 more

  • New
  • Research Article
  • 10.1093/molbev/msaf257
GHIST 2024: The First Genomic History Inference Strategies Tournament.
  • Oct 29, 2025
  • Molecular biology and evolution
  • Travis J Struck + 16 more

  • New
  • Research Article
  • 10.1093/molbev/msaf273
Evolution of the plant-associated Pantoea was accompanied by plasmid domestication events.
  • Oct 29, 2025
  • Molecular biology and evolution
  • Devani Romero Picazo + 5 more

  • New
  • Research Article
  • 10.1093/molbev/msaf277
DIVERGE v4: a platform for large-scale analysis of functional divergence across multi-gene families.
  • Oct 28, 2025
  • Molecular biology and evolution
  • Yichang Chen + 9 more

  • New
  • Research Article
  • 10.1093/molbev/msaf279
Classifying Convergences in the Light of Horizontal Gene Transfer: Epaktovars and Xenotypes.
  • Oct 28, 2025
  • Molecular biology and evolution
  • James O Mcinerney

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon