Abstract

The amount of available digital data for the languages of the world is constantly increasing. Unfortunately, most of the digital data are provided in a large variety of formats and therefore not amenable for comparison and re-use. The Cross-Linguistic Data Formats initiative proposes new standards for two basic types of data in historical and typological language comparison (word lists, structural datasets) and a framework to incorporate more data types (e.g. parallel texts, and dictionaries). The new specification for cross-linguistic data formats comes along with a software package for validation and manipulation, a basic ontology which links to more general frameworks, and usage examples of best practices.

Highlights

  • The last two decades have witnessed a dramatic increase in language data, in form of monolingual resources[1] for the world’s biggest languages, and in form of cross-linguistic datasets which try to cover as many of the world’s languages as possible

  • To address the above-mentioned obstacles of sharing and re-use of cross-linguistic datasets, the CrossLinguistic Data Formats initiative (CLDF) offers modular specifications for common data types in language typology and historical linguistics, which are based on a shared data model and a formal ontology

  • The core concepts of this model have been derived from the data model which was originally developed for the Cross-Linguistic Linked Data project, which aimed at developing and curating interoperable data publication structures using linked data principles as the integration mechanism for distributed resources

Read more

Summary

Introduction

The last two decades have witnessed a dramatic increase in language data, in form of monolingual resources[1] for the world’s biggest languages, and in form of cross-linguistic datasets which try to cover as many of the world’s languages as possible. Cross-linguistic data have proven useful to detect semantic structures which are universal across human populations[6], and how semantic systems like color terminology have evolved[7,8]. Another group of studies have analysed cross-linguistic data using quantitative phylogenetic methods to investigate when particular language families started to diverge[9,10,11,12]. Cross-linguistic studies have even explored proposed non-linguistic factors shaping languages from climate[13,14], to population size[15,16,17], to genes[18,19], and how these factors may or may not shape human social behavior at a society level20. (All URLS mentioned in this paragraph were accessed July 26, 2018)

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call