Abstract

We consider the construction of part-of-speech taggers for resource-poor languages. Recently, manually constructed tag dictionaries from Wiktionary and dictionaries projected via bitext have been used as type constraints to overcome the scarcity of annotated data in this setting. In this paper, we show that additional token constraints can be projected from a resource-rich source language to a resource-poor target language via word-aligned bitext. We present several models to this end; in particular a partially observed conditional random field model, where coupled token and type constraints provide a partial signal for training. Averaged across eight previously studied Indo-European languages, our model achieves a 25% relative error reduction over the prior state of the art. We further present successful results on seven additional languages from different families, empirically demonstrating the applicability of coupled token and type constraints across a diverse set of languages.

Highlights

  • IntroductionSupervised part-of-speech (POS) taggers are available for more than twenty languages and achieve accuracies of around 95% on in-domain data (Petrov et al, 2012)

  • The aforementioned weakly supervised approaches have resulted in significant improvements over fully unsupervised approaches, they have not exploited the benefits of token-level cross-lingual projection methods, which are possible with wordaligned bitext between a target language of interest and a resource-rich source language, such as English

  • We observe similar trends in these results, but on average, accuracies are much lower compared to the typeconstrained hidden Markov model (HMM) models; the conditional random field (CRF) model with the union dictionary along with cluster features achieves an average accuracy of 79.3% when trained on same data

Read more

Summary

Introduction

Supervised part-of-speech (POS) taggers are available for more than twenty languages and achieve accuracies of around 95% on in-domain data (Petrov et al, 2012). Noisily projected via word-aligned bitext, have bridged the gap between purely unsupervised and fully supervised taggers, resulting in an average accuracy of over 83% on a benchmark of eight Indo-European languages (Das and Petrov, 2011). The aforementioned weakly supervised approaches have resulted in significant improvements over fully unsupervised approaches, they have not exploited the benefits of token-level cross-lingual projection methods, which are possible with wordaligned bitext between a target language of interest and a resource-rich source language, such as English. This is the setting we consider in this paper (§2).

Coupling Token and Type Constraints
Token Constraints
Type Constraints
Coupled Constraints
Models with Coupled Constraints
HMMs with Coupled Constraints
CRFs with Coupled Constraints
Empirical Study
Experimental Setup
Models with Type Constraints
Models with Token and Type Constraints
Further Analysis
Influence of Token and Type Constraints
Wiktionary Pruning Mistakes
Findings
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.