Multilingual Projection for Parsing Truly Low-Resource Languages

Željko Agić,Barbara Plank,Anders Johannsen,Anders Søgaard,Héctor Martínez Alonso,Natalie Schluter

doi:10.1162/tacl_a_00100

Abstract

We propose a novel approach to cross-lingual part-of-speech tagging and dependency parsing for truly low-resource languages. Our annotation projection-based approach yields tagging and parsing models for over 100 languages. All that is needed are freely available parallel texts, and taggers and parsers for resource-rich languages. The empirical evaluation across 30 test languages shows that our method consistently provides top-level accuracies, close to established upper bounds, and outperforms several competitive baselines.

Highlights

State-of-the-art approaches to inducing part-ofspeech (POS) taggers and dependency parsers only scale to a small fraction of the world’s ∼6,900 languages
The major bottleneck is the lack of manually annotated resources for the vast majority of these languages, including languages spoken by millions, such as Marathi (73m), Hausa (50m), and Kurdish (30m)
The first one does not entirely adhere to Universal Dependencies (UD), but we provide a POS tagset mapping and a few modifications and include it as a test language to deepen the robustness assessment for our approach across language families

Summary

Introduction

State-of-the-art approaches to inducing part-ofspeech (POS) taggers and dependency parsers only scale to a small fraction of the world’s ∼6,900 languages. Cross-lingual transfer learning—or cross-lingual learning—refers to work on using annotated resources in other (source) languages to induce models for such low-resource (target) languages. Most work in cross-lingual learning, makes assumptions about the availability of linguistic resources that do not hold for the majority of low-resource languages. The best cross-lingual dependency parsing results reported to date were presented by Rasooli and Collins (2015). They use the intersection of languages covered in the Google dependency treebanks project and those contained in the Europarl corpus. They only consider closely related Indo-European languages for which high-quality tokenization can be obtained with simple heuristics

Objectives

Methods

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Transactions of the Association for Computational Linguistics	Publication Date: Dec 1, 2016
Citations: 104	License type: cc-by

R Discovery Prime

R Discovery Prime

Multilingual Projection for Parsing Truly Low-Resource Languages

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Transactions of the Association for Computational Linguistics

Lead the way for us

Similar Papers

Pre-trained Word Embedding based Parallel Text Augmentation Technique for Low-Resource NMT in Favor of Morphologically Rich Languages
Tulu Tilahun Hailu ... Tessfu Geteye Fantaye
-
Tulu Tilahun Hailu, et. al.Tulu Tilahun Hailu ... Tessfu Geteye Fantaye
22 Oct 2019
22 Oct 2019

Towards a Crowdsourcing Platform for Low Resource Languages -- A Collectivist Approach
...
-
, et. al. ...
19 Oct 2020
19 Oct 2020

What Led Us to the Data

Colorado Research in Linguistics | VOL. -

22 Aug 2021
Colorado Research in Linguistics | VOL. -

Delexicalized transfer parsing for low-resource languages using transformed and combined treebanks
Ayan Das ... Affan Zaffar
-
Ayan Das, et. al.Ayan Das ... Affan Zaffar
01 Jan 2017
01 Jan 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multilingual Projection for Parsing Truly Low-Resource Languages

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Transactions of the Association for Computational Linguistics