On the accuracy of language trees.

Simone Pompei,Francesca Tria,Vittorio Loreto

doi:10.1371/journal.pone.0020109

Simone Pompei, Francesca Tria + Show 1 more

Open Access

https://doi.org/10.1371/journal.pone.0020109

Copy DOI

Abstract

Historical linguistics aims at inferring the most likely language phylogenetic tree starting from information concerning the evolutionary relatedness of languages. The available information are typically lists of homologous (lexical, phonological, syntactic) features or characters for many different languages: a set of parallel corpora whose compilation represents a paramount achievement in linguistics.From this perspective the reconstruction of language trees is an example of inverse problems: starting from present, incomplete and often noisy, information, one aims at inferring the most likely past evolutionary history. A fundamental issue in inverse problems is the evaluation of the inference made. A standard way of dealing with this question is to generate data with artificial models in order to have full access to the evolutionary process one is going to infer. This procedure presents an intrinsic limitation: when dealing with real data sets, one typically does not know which model of evolution is the most suitable for them. A possible way out is to compare algorithmic inference with expert classifications. This is the point of view we take here by conducting a thorough survey of the accuracy of reconstruction methods as compared with the Ethnologue expert classifications. We focus in particular on state-of-the-art distance-based methods for phylogeny reconstruction using worldwide linguistic databases.In order to assess the accuracy of the inferred trees we introduce and characterize two generalizations of standard definitions of distances between trees. Based on these scores we quantify the relative performances of the distance-based algorithms considered. Further we quantify how the completeness and the coverage of the available databases affect the accuracy of the reconstruction. Finally we draw some conclusions about where the accuracy of the reconstructions in historical linguistics stands and about the leading directions to improve it.

Highlights

The last few years have seen a wave of computational approaches devoted to historical linguistics [1,2,3], mainly centred around phylogenetic methods
The quantification of the accuracy of the inference of language trees we present is achieved with the Robinson-Foulds distance (RF) [25] and the Quartet Distance (QD) [26]
We present the results of the comparisons between the Ethnologue classifications and the language trees inferred based on the Automated Similarity Judgement Program (ASJP) database

Summary

Introduction

The last few years have seen a wave of computational approaches devoted to historical linguistics [1,2,3], mainly centred around phylogenetic methods. Statistical tools [4,5,6,7,8,9], for instance, permit to assign time weights to the edges of a phylogenetic tree, giving the opportunity to gather information about the past history of the whole evolutionary process. The initial set of meanings included 200 items which were reduced down to 100, including some new terms which were not in his original list These famous 100-item Swadesh lists still represent the cornerstone of any attempts to reconstruct phylogenies in historical linguistics

Methods

Results

Conclusion