Recent work in computational linguistic phylogeny* Joseph F. Eska and Don Ringe 1. Introduction A number of recent attempts by nonlinguists to reconstruct linguistic evolutionary trees have made news. Reconstructions of the phylogeny of the Indo-European (IE) family of languages are especially well represented; well-known examples include Rexová et al. 2003, Gray& Atkinson 2003—which is discussed by Searls (2003) and briefly reported in U.S. news & world report (10 December 2003)—and Forster & Toth 2003, which also generated considerable attention in the popular media. Scientific linguists have not been impressed for a variety of reasons. Though no two of the publications in question exhibit exactly the same weaknesses, all can be impugned on one or more of the following grounds: the linguistic data employed have not been adequately analyzed, or—in some cases—even competently analyzed; the model of language change employed has not been shown to fit the known facts of language change; attempts to fix the dates of prehistoric languages have ignored the fatal shortcomings of glottochronology discovered by Bergsland and Vogt (1962; see further §4); the researchers assume that vocabulary replacement is governed by a lexical clock (similar to the controversial molecular clock posited by some biological cladists);1 and/or the data set used is too small to yield statistically reliable conclusions. A thoroughgoing critique of all recently published work in this vein would be unwieldy and would require far more space than a discussion note permits. Instead, we focus on the article that best exemplifies the shortcomings listed above, namely the work of Forster and Toth. 1.1 In an article published recently in the Proceedings of the National Academy of Sciences, Peter Forster and Alfred Toth propose a new computational method for recovering the dates of prehistoric events of linguistic speciation (2003, cited hereafter as F & T). The article appeared with an online appendix,2 as well as references to a web tutorial describing how F & T handle their linguistic data.3 If F & T’s proposal were scientifically adequate, it could represent an important methodological advance in historical linguistics. But we demonstrate in this discussion note that they control neither the data nor the necessary linguistic methodology, that their treatment of linguistic data amounts to an explicit rejection of scientific historical linguistics to the point that it could be called antiscientific, and that their application of computational methods for recovering linguistic evolutionary trees is inadequate to prove the dating of prehistoric linguistic events that they claim. [End Page 569] We wish to emphasize two points at the start. First, we critique F & T’s online appendix and web tutorial together with their published article, since all those materials together constitute a coherent published project. Second, the reader should bear in mind that the main thrust of F & T’s work is the dating of prehistoric linguistic events. Every other part of their project—including their selection and processing of the data, discussed mostly on the associated web sites—serves as input to the process of inferring dates; thus a serious error in any single part invalidates the inference of dates. We show that there are serious errors in every part of their work. 2. Errors in the selection and treatment of the data Any attempt to assess phylogenetic relationships among human languages rests crucially upon the accurate selection and coding of the data. Just as small programming errors can lead to severe problems in the operation of computer software, errors and/or ambiguities in the selection and coding of linguistic data can lead to significant errors in the construction of phylogenetic trees. Work of this kind, then, requires either an intimate knowledge of the languages concerned or access to sources that one can trust to be reliable. F & T elect to base their project on the Celtic languages, but it is very clear to us that they not only are not familiar with the linguistic data that they use, but also are not familiar enough with the current state of Celtic historical linguistics to make critical use of the secondary literature. Sometimes they simply ignore what their sources have to say. 2.1 We first note that F...
Read full abstract