Abstract

The Lesser Sunda Islands in eastern Indonesia cover a longitudinal distance of some 600 kilometres. They are the westernmost place where languages of the Austronesian family come into contact with a family of Papuan languages and constitute an area of high linguistic diversity. Despite its diversity, the Lesser Sundas are little studied and for most of the region, written historical records, as well as archaeological and ethnographic data are lacking. In such circumstances the study of relationships between languages through their lexicon is a unique tool for making inferences about human (pre-)history and tracing population movements. However, the lack of a collective body of lexical data has severely limited our understanding of the history of the languages and peoples in the Lesser Sundas. The LexiRumah database fills this gap by assembling lexicons of Lesser Sunda languages from published and unpublished sources, and making those lexicons available online in a consistent format. This database makes it possible for researchers to explore the linguistic data collated from different primary sources, to formulate hypotheses on how the languages of the two families might be internally related and to compare competing hypotheses about subgroupings and language contact in the region. In this article, we present observations from aggregating lexical data from sources of different type and quality, including fieldwork, and generalize our lessons learned towards practical guidelines for creating a consistent database of comparable lexical items, derived from the design and development of LexiRumah. Databases like this are instrumental in developing theories of language evolution and change in understudied regions where small-scale, pre-industrial, pre-literate societies are the majority. It is therefore vital to follow reliable design choices when creating such databases, as described in this paper.

Highlights

  • The Lesser Sunda Islands in eastern Indonesia are an area of high linguistic diversity where several hundreds of often vastly different language varieties are spoken

  • LexiRumah is designed as a tool to investigate the linguistic history of the Lesser Sunda Islands and contains the lexicon of 101 varieties from two language families spoken in the region: Austronesian (Malayo-Polynesian) and Timor-Alor-Pantar

  • Creating a database such as LexiRumah has become significantly easier in the last decade, and hardly comparable to the first computer-readable lexical databases such as the Indo-European database on punched cards generated by Dyen in the 1960s

Read more

Summary

Introduction

The Lesser Sunda Islands in eastern Indonesia are an area of high linguistic diversity where several hundreds of often vastly different language varieties are spoken. The aim of the LexiRumah database is to provide easy online access to large amounts of lexical data for the wider scientific community–including linguists, historians, and ethnographers This database makes it possible for researchers to explore the linguistic data collected from primary sources, to formulate hypotheses on how the languages of the two families might be internally related and to compare competing hypotheses about subgroupings in the region [11,12,13]. LexiRumah is designed as a tool to investigate the linguistic history of the Lesser Sunda Islands and contains the lexicon of 101 varieties from two language families spoken in the region: Austronesian (Malayo-Polynesian) and Timor-Alor-Pantar. Chirila and TransNewGuinea. org, were designed as ways to curate, compile and disseminate lexical data from old and recent sources, including dictionaries, and as such they contain long word lists of particular languages

Overview of the paper
Other aims
Language collection policies
Database structure and user interface
Word form table
Concept table
Lect table
Source table
Cognate table
Data content
Differences and overlaps with other online lexical databases
Workflow and challenges
Compiling word lists through fieldwork
Transcription
Extraction from published sources
IPA cleanup
Checks by native speakers
Editing CLDF using other software
Similarity coding
Conversion into SQLite for the online interface
5.10 External access
5.11 Versioning and backups
Example use case
Discussion
Do not remove data
Keep thinking linguistically
Ensure credit where credit is due
Outlook
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call