Abstract

Starting from a large collection of digitized raw-text descriptions of languages of the world, we address the problem of extracting information of interest to linguists from these. We describe a general technique to extract properties of the described languages associated with a specific term. The technique is simple to implement, simple to explain, requires no training data or annotation, and requires no manual tuning of thresholds. The results are evaluated on a large gold standard database on classifiers with accuracy results that match or supersede human inter-coder agreement on similar tasks. Although accuracy is competitive, the method may still be enhanced by a more rigorous probabilistic background theory and usage of extant NLP tools for morphological variants, collocations and vector-space semantics.

Highlights

  • The present paper addresses extraction of information about languages of the world from digitized full-text grammatical descriptions

  • The typical instances of such informationextraction tasks are so-called typological features, e.g., whether the language has tone, prepositions, SOV basic constituent order and so on, similar in spirit to those found in the database WALS wals.info (Dryer and Haspelmath, 2013)

  • We focus on the prospects of term spotting, but in a way that obviates the need for either manual tuning of thresholds or supervised training data

Read more

Summary

Introduction

The present paper addresses extraction of information about languages of the world from digitized full-text grammatical descriptions. We focus on the prospects of term spotting, but in a way that obviates the need for either manual tuning of thresholds or supervised training data This approach is limited to the features for which a (small set of) specific terms frequently signal the presence thereof, e.g., classifier, suffix(es), preposition(s), rounded vowel(s) or inverse. If a term k describing a property of objects in S occurs in a document d to a significant degree, the object s described in d has the property signalled by k These premises apply to other domains and texts, e.g., ethnographic descriptions, than the linguistic descriptions in the present study. Some post-correction of OCR output very relevant for the genre of linguistics is possible and advisable

Italian ita
Majority consensus
Findings
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call