Abstract

The growing work in multi-lingual parsing faces the challenge of fair comparative evaluation and performance analysis across languages and their treebanks. The difficulty lies in teasing apart the properties of treebanks, such as their size or average sentence length, from those of the annotation scheme, and from the linguistic properties of languages. We propose a method to evaluate the effects of word order of a language on dependency parsing performance, while controlling for confounding treebank properties. The method uses artificially-generated treebanks that are minimal permutations of actual treebanks with respect to two word order properties: word order variation and dependency lengths. Based on these artificial data on twelve languages, we show that longer dependencies and higher word order variability degrade parsing performance. Our method also extends to minimal pairs of individual sentences, leading to a finer-grained understanding of parsing errors.

Highlights

  • Fair comparative performance evaluation across languages and their treebanks is one of the difficulties for work on multi-lingual parsing (Buchholz and Marsi, 2006; Nivre et al, 2007; Seddah et al, 2011)

  • In a set of pairwise comparisons between original and permuted treebanks, we confirm the influence of word order variability and dependency length on parsing performance, at the large scale provided by fourteen different treebanks across twelve different languages

  • The graph-based architecture is known to be less dependent on word order and dependency length than transition-based dependency parsers, as it searches the whole space of possible parse trees and solves a global optimisation problem (McDonald and Nivre, 2011)

Read more

Summary

Introduction

Fair comparative performance evaluation across languages and their treebanks is one of the difficulties for work on multi-lingual parsing (Buchholz and Marsi, 2006; Nivre et al, 2007; Seddah et al, 2011). We compare how the parsing performances on the original and the permuted trees vary in relation to the quantified measures of the dependency length and word order variation properties of the treebanks. Morphologically-rich languages are known to be hard for parsing, as rich morphology increases the percentage of new words in the test set (Nivre et al, 2007; Tsarfaty et al, 2010) These languages often exhibit very flexible word order. In a set of pairwise comparisons between original and permuted treebanks, we confirm the influence of word order variability and dependency length on parsing performance, at the large scale provided by fourteen different treebanks across twelve different languages.. On an example of one treebank, we show how our method can be extended to provide finer-grained analyses at the sentence level and relate the parsing errors to properties of the parsing architecture

Methodology
Word order properties
Creating trees with optimal DL
Creating trees with optimal Entropy
Dependency Treebanks
Word order properties of original and permuted treebanks
Parsing setup
Comparison of parsing performance between original and permuted treebanks
Sentence-level analysis of parsing performance
General discussion
Related work
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call