Abstract

In this paper, we will demonstrate a system that shows great promise for creating Part-of-Speech taggers for languages with little to no curated resources available, and which needs no expert involvement. Interlinear Glossed Text (IGT) is a resource which is available for over 1,000 languages as part of the Online Database of INterlinear text (ODIN) (Lewis and Xia, 2010). Using nothing more than IGT from this database and a classification-based projection approach tailored for IGT, we will show that it is feasible to train reasonably performing annotators of interlinear text using projected annotations for potentially hundreds of world’s languages. Doing so can facilitate automatic enrichment of interlinear resources to aid the field of linguistics.

Highlights

  • In this paper we discuss the process by which a highly multilingual linguistic resource can be built and subsequently automatically enriched

  • We show that the linguistic knowledge encapsulated in all of the data, irrespective of the language, can improve the accuracy of NLP tools that are developed for any specific language

  • We focus on using a resource known as Interlinear Glossed Text (IGT) as a possible source of linguistic knowledge for the POS tagging task on resource-poor languages, and apply it to the enrichment of a linguistic resource composed of IGT data

Read more

Summary

Introduction

In this paper we discuss the process by which a highly multilingual linguistic resource (greater than 1,200 languages) can be built and subsequently automatically enriched. We show that the linguistic knowledge encapsulated in all of the data, irrespective of the language, can improve the accuracy of NLP tools that are developed for any specific language This is true for languages that are otherwise highly under-resourced, and where the development of automated NLP tools, such as taggers, are either not possible or very expensive to develop using traditional methods. POS tagging is generally thought of as a solved task for many languages, with per-token accuracies reaching 97% (Brants, 2000; Toutanova et al, 2003) While these high accuracies can certainly be achieved for languages with substantial annotated resources, many lowresource languages have little to no annotated data available, making such traditional supervised approaches impossible. If annotated resources are not available, what methods can be used?

Objectives
Methods
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call