Abstract

A lexicostatistical classification is proposed for 20 languages and dialects of the Lezgian group of the North Caucasian family, based on meticulously compiled 110-item wordlists, published as part of the Global Lexicostatistical Database project. The lexical data have been subsequently analyzed with the aid of the principal phylogenetic methods, both distance-based and character-based: Starling neighbor joining (StarlingNJ), Neighbor joining (NJ), Unweighted pair group method with arithmetic mean (UPGMA), Bayesian Markov chain Monte Carlo (MCMC), Unweighted maximum parsimony (UMP). Cognation indexes within the input matrix were marked by two different algorithms: traditional etymological approach and phonetic similarity, i.e., the automatic method of consonant classes (Levenshtein distances). Due to certain reasons (first of all, high lexicographic quality of the wordlists and a consensus about the Lezgian phylogeny among Caucasologists), the Lezgian database is a perfect testing area for appraisal of phylogenetic methods. For the etymology-based input matrix, all the phylogenetic methods, with the possible exception of UMP, have yielded trees that are sufficiently compatible with each other to generate a consensus phylogenetic tree of the Lezgian lects. The obtained consensus tree agrees with the traditional expert classification as well as some of the previously proposed formal classifications of this linguistic group. Contrary to theoretical expectations, the UMP method has suggested the least plausible tree of all. In the case of the phonetic similarity-based input matrix, the distance-based methods (StarlingNJ, NJ, UPGMA) have produced the trees that are rather close to the consensus etymology-based tree and the traditional expert classification, whereas the character-based methods (Bayesian MCMC, UMP) have yielded less likely topologies.

Highlights

  • For data elaborated by the StarlingNJ method, two kinds of trees are offered: a tree with binary nodes only, and the same tree, where neighboring nodes are joined in one node if the temporal distance between them is 300 years or less (300 years correspond to mutation of ca. 1.5 words in a lect, a reasonable calculation error)

  • All distance-based methods, i.e., StarlingNJ, Neighbor joining (NJ), BioNJ, Unweighted pair group method with arithmetic mean (UPGMA) (Figs. 2, 4, 5) suggest consecutive bifurcations with the Udi branch split off first and the rest divided into the Archi and Nuclear Lezgian

  • The database consists of a relatively large amount of taxa: 20 lects. There are both languages which existed isolated for a long time, e.g., Archi, and languages which actively contact with other languages of the same group, e.g., Aghul

Read more

Summary

Methods

Lexicostatistical trees were produced by several phylogenetic methods. 1. For data elaborated by the StarlingNJ method, two kinds of trees are offered: a tree with binary nodes only (as produced by the NJ algorithm), and the same tree, where neighboring nodes are joined in one node if the temporal distance between them is 300 years or less The trees were produced in the SplitsTree software v.4.13.1 [31] from the binary lexicostatistical matrix (NEXUS format) which was generated from the original multistate matrix by coding the presence (“1”) or absence (“0”) of each proto-root in each of the 21 languages (Swadesh items superseded by loanwords or not documented are marked as “?”). For the etymology-based wordlist, 4 optimal trees of equal cost were obtained and the strict consensus tree was produced, for which the non-parametric bootstrap test was performed (1000 pseudoreplicates). The trees were visualized in the FigTree software (v.1.4.0)

Results
The database consists of a relatively large amount of taxa
Moscow
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call