Abstract
The growing work in multi-lingual parsing faces the challenge of fair comparative evaluation and performance analysis across languages and their treebanks. The difficulty lies in teasing apart the properties of treebanks, such as their size or average sentence length, from those of the annotation scheme, and from the linguistic properties of languages. We propose a method to evaluate the effects of word order of a language on dependency parsing performance, while controlling for confounding treebank properties. The method uses artificially-generated treebanks that are minimal permutations of actual treebanks with respect to two word order properties: word order variation and dependency lengths. Based on these artificial data on twelve languages, we show that longer dependencies and higher word order variability degrade parsing performance. Our method also extends to minimal pairs of individual sentences, leading to a finer-grained understanding of parsing errors.
Highlights
Fair comparative performance evaluation across languages and their treebanks is one of the difficulties for work on multi-lingual parsing (Buchholz and Marsi, 2006; Nivre et al, 2007; Seddah et al, 2011)
In a set of pairwise comparisons between original and permuted treebanks, we confirm the influence of word order variability and dependency length on parsing performance, at the large scale provided by fourteen different treebanks across twelve different languages
The graph-based architecture is known to be less dependent on word order and dependency length than transition-based dependency parsers, as it searches the whole space of possible parse trees and solves a global optimisation problem (McDonald and Nivre, 2011)
Summary
Fair comparative performance evaluation across languages and their treebanks is one of the difficulties for work on multi-lingual parsing (Buchholz and Marsi, 2006; Nivre et al, 2007; Seddah et al, 2011). We compare how the parsing performances on the original and the permuted trees vary in relation to the quantified measures of the dependency length and word order variation properties of the treebanks. Morphologically-rich languages are known to be hard for parsing, as rich morphology increases the percentage of new words in the test set (Nivre et al, 2007; Tsarfaty et al, 2010) These languages often exhibit very flexible word order. In a set of pairwise comparisons between original and permuted treebanks, we confirm the influence of word order variability and dependency length on parsing performance, at the large scale provided by fourteen different treebanks across twelve different languages.. On an example of one treebank, we show how our method can be extended to provide finer-grained analyses at the sentence level and relate the parsing errors to properties of the parsing architecture
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Transactions of the Association for Computational Linguistics
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.