Abstract

This paper presents a method for linking models for aligning linguistic etymological data with models for phylogenetic inference from population genetics. We begin with a large database of genetically related words—sets of cognates—from languages in a language family. We process the cognate sets to obtain a complete alignment of the data. We use the alignments as input to a model developed for phylogenetic reconstruction in population genetics. This is achieved via a natural novel projection of the linguistic data onto genetic primitives. As a result, we induce phylogenies based on aligned linguistic data. We place the method in the context of those reported in the literature, and illustrate its operation on data from the Uralic language family, which results in family trees that are very close to the “true” (expected) phylogenies.

Highlights

  • Mathematical theory of statistical physics has been shown to unite stochastic models of evolution in seemingly diverse fields, such as population genetics, ecology and linguistics (Blythe and McKane, 2007; Blythe, 2009; Baxter et al, 2009; Vazquez et al, 2010)

  • We alternate between two steps: A. update the count matrix and compute the code length, and B. re-align all word pairs in the corpus, using dynamic-programming re-alignment

  • During the dynamic-programming step, for each word pair we find the best alignment, i.e., the one with the lowest cost given the alignments for rest of the words

Read more

Summary

Introduction

Mathematical theory of statistical physics has been shown to unite stochastic models of evolution in seemingly diverse fields, such as population genetics, ecology and linguistics (Blythe and McKane, 2007; Blythe, 2009; Baxter et al, 2009; Vazquez et al, 2010). Statistical inference about language evolution under such models is complicated by the practically intractable form of likelihoods for even a moderate set of languages This calls for novel ways to probabilistic evaluation of any particular phylogenetic model and for learning the most plausible genealogies from data. In contrast to coalescentbased likelihoods, this approach enables analysis of much larger data collections, as the sufficient statistics from the data correspond under these models to the empirical allele frequencies of each population, rather than genetic characteristics of single individuals. This property makes these models attractive from the perspective of evolutionary linguistics

Methods
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.