Sampling and genealogical coverage in WALS

Harald Hammarström

doi:10.1515/lity.2009.006

Abstract

WALS was designed with the goal of providing a “systematic answer” to questions about the geographical distribution of language features. In order to achieve this goal, there must be an adequate sample of the world's languages included in WALS. In this article we investigate to what extent WALS fulfils its aim of maximizing the genealogical diversity of the samples of languages included. For this we look at the core-200 sample (included on almost all maps) as well as the 1,370 sample for the feature OV/VO word order (the sample with the largest number of languages). The genealogical diversity in these samples is compared against a database of “what could have been done”, i.e., a database of which language families have adequate descriptive resources for the task at hand. In the 200 sample, we find a highly significant overinclusion of Eurasian languages at the expense of South American and Papuan languages. In the 1,370 sample, we find a highly significant overinclusion of North American languages at the expense of South American and Papuan languages. It follows that statistics based on these WALS samples cannot be used straightforwardly for sound inferences about the distribution of the features in question.

Full Text