Abstract

We describe a simple but effective method for cross-lingual syntactic transfer of dependency parsers, in the scenario where a large amount of translation data is not available. This method makes use of three steps: 1) a method for deriving cross-lingual word clusters, which can then be used in a multilingual parser; 2) a method for transferring lexical information from a target language to source language treebanks; 3) a method for integrating these steps with the density-driven annotation projection method of Rasooli and Collins (2015). Experiments show improvements over the state-of-the-art in several languages used in previous work, in a setting where the only source of translation data is the Bible, a considerably smaller corpus than the Europarl corpus used in previous work. Results using the Europarl corpus as a source of translation data show additional improvements over the results of Rasooli and Collins (2015). We conclude with results on 38 datasets from the Universal Dependencies corpora.

Highlights

  • Creating manually-annotated syntactic treebanks is an expensive and time consuming task

  • We describe a method for transfer of lexical information from the target language into source language treebanks, using word-to-word translation dictionaries derived from parallel corpora

  • We describe an approach that gives significant improvements over the baseline. §3.1 describes a method for deriving cross-lingual clusters, allowing us to add cluster features φ(c)(x, y) to the model. §3.2 describes a method for adding lexical features φ(l)(x, y) to the model. §3.3 describes a method for integrating the approach with the density-driven approach of Rasooli and Collins (2015)

Read more

Summary

Introduction

Creating manually-annotated syntactic treebanks is an expensive and time consuming task. The Bible data contains a much smaller set of sentences (around 24,000) than other translation corpora, for example Europarl (Koehn, 2005), which has around 2 million sentences per language pair. This makes it a considerably more challenging corpus to work with. We achieve 80.9% average unlabeled attachment score (UAS) on these languages; in comparison the work of Zhang and Barzilay (2015), Guo et al (2016) and Ammar et al (2016b) have a UAS of 75.4%, 76.3% and 77.8%, respectively All of these previous works make use of the much larger Europarl (Koehn, 2005) corpus to derive lexical representations. Thirteen datasets (10 languages) have accuracies higher than 80.0%.1

The Parsing Model
Data Assumptions
A Baseline Approach
Translation Dictionaries
Our Approach
Learning Cross-Lingual Clusters
Treebank Lexicalization
Data and Tools
Results on the Google Treebank
Related Work
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call