Abstract

Genome sequencing projects routinely generate haploid consensus sequences from diploid genomes, which are effectively chimeric sequences with the phase at heterozygous sites resolved at random. The impact of phasing errors on phylogenomic analyses under the multispecies coalescent (MSC) model is largely unknown. Here, we conduct a computer simulation to evaluate the performance of four phase-resolution strategies (the true phase resolution, the diploid analytical integration algorithm which averages over all phase resolutions, computational phase resolution using the program PHASE, and random resolution) on estimation of the species tree and evolutionary parameters in analysis of multilocus genomic data under the MSC model. We found that species tree estimation is robust to phasing errors when species divergences were much older than average coalescent times but may be affected by phasing errors when the species tree is shallow. Estimation of parameters under the MSC model with and without introgression is affected by phasing errors. In particular, random phase resolution causes serious overestimation of population sizes for modern species and biased estimation of cross-species introgression probability. In general, the impact of phasing errors is greater when the mutation rate is higher, the data include more samples per species, and the species tree is shallower with recent divergences. Use of phased sequences inferred by the PHASE program produced small biases in parameter estimates. We analyze two real data sets, one of East Asian brown frogs and another of Rocky Mountains chipmunks, to demonstrate that heterozygote phase-resolution strategies have similar impacts on practical data analyses. We suggest that genome sequencing projects should produce unphased diploid genotype sequences if fully phased data are too challenging to generate, and avoid haploid consensus sequences, which have heterozygous sites phased at random. In case the analytical integration algorithm is computationally unfeasible, computational phasing prior to population genomic analyses is an acceptable alternative. [BPP; introgression; multispecies coalescent; phase; species tree.]

Highlights

  • Next-generation sequencing technologies have revolutionized population genetics and phylogenetics by making it affordable to sequence whole genomes or large portions of the genome, even for nonmodel organisms

  • Bayesian analysis of each replicate data set using each of the four strategies produced a sample from the posterior distribution of the species trees, which we summarized to identify the maximum a posteriori probability (MAP) tree, and construct the 95% credibility set of species trees

  • The different strategies most often produced the same MAP tree, the posterior probability attached to the MAP tree varies somewhat among methods, but the differences are comparable to Markov chain Monte Carlo (MCMC) sampling errors

Read more

Summary

Introduction

Next-generation sequencing technologies have revolutionized population genetics and phylogenetics by making it affordable to sequence whole genomes or large portions of the genome, even for nonmodel organisms. Many phylogenomic studies use the approach of reduced representation library to maximize their DNA sequencing efforts on a small subset of the genome. These strategies can generate thousands of genomic segments (called loci in this article irrespective of whether they are protein-coding) with high coverage, and target sequences can be assembled with confidence. Typical sequencing technologies produce short fragments of sequenced DNA called “reads” that are either de novo assembled or mapped to a pre-existing reference genome. This leads to chromosomal positions being sequenced a variable number of times across the genome (usually referred to as the sequencing depth). Suppose the reads are 14×T and 6×C at the first site, and 7×A, 10×G, and 1×T at the second

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call