The advent of rapid and inexpensive sequencing technologies has necessitated the development of computationally efficient methods for analyzing sequence data for many genes simultaneously in a phylogenetic framework. The coalescent process is the most commonly used model for linking the underlying genealogies of individual genes with the global species-level phylogeny, but inference under the coalescent model is computationally daunting in the typical inference frameworks (e.g., the likelihood and Bayesian frameworks) due to the dimensionality of the space of both gene trees and species trees. Here we consider estimation of the branch lengths in fixed species trees with three or four taxa, and show that these branch lengths are identifiable. We also show that for three and four taxa simple estimators for the branch lengths can be derived based on observed site pattern frequencies. Properties of these estimators, such as their asymptotic variances and large-sample distributions, are examined, and performance of the estimators is assessed using simulation. Finally, we use these estimators to develop a hypothesis test that can be used to delimit species under the coalescent model for three or four putative taxa.
Read full abstract