Abstract
Cross-lingual dependency parsing involves transferring syntactic knowledge from one language to another. It is a crucial component for inducing dependency parsers in low-resource scenarios where no training data for a language exists. Using Faroese as the target language, we compare two approaches using annotation projection: first, projecting from multiple monolingual source models; second, projecting from a single polyglot model which is trained on the combination of all source languages. Furthermore, we reproduce multi-source projection (Tyers et al., 2018), in which dependency trees of multiple sources are combined. Finally, we apply multi-treebank modelling to the projected treebanks, in addition to or alternatively to polyglot modelling on the source side. We find that polyglot training on the source languages produces an overall trend of better results on the target language but the single best result for the target language is obtained by projecting from monolingual source parsing models and then training multi-treebank POS tagging and parsing models on the target side.
Highlights
Cross-lingual transfer methods, i. e. methods that transfer knowledge from one or more source languages to a target language, have led to substantial improvements for low-resource dependency parsing (Rosa and Marecek, 2018; Agicet al., 2016; Guo et al, 2015; Lynn et al, 2014; McDonald et al, 2011; Hwa et al, 2005) and part-ofspeech (POS) tagging (Plank and Agic, 2018)
Inspired by recent literature involving multilingual learning (Mulcaire et al, 2019; Smith et al, 2018; Vilares et al, 2016), we investigate whether additional improvements can be made by: 1. using a single polyglot2 parsing model which is trained on the combination of all source languages to create synthetic source treebanks
We aim to investigate whether the current state-of-the-art approach for Faroese, which relies on cross-lingual transfer, can be improved upon by adopting an approach based on source-side polyglot learning and/or target-side multi-treebank learning
Summary
Cross-lingual transfer methods, i. e. methods that transfer knowledge from one or more source languages to a target language, have led to substantial improvements for low-resource dependency parsing (Rosa and Marecek, 2018; Agicet al., 2016; Guo et al, 2015; Lynn et al, 2014; McDonald et al, 2011; Hwa et al, 2005) and part-ofspeech (POS) tagging (Plank and Agic, 2018). We build on recent work by Tyers et al (2018) who show that in the absence of annotated training data for the target language, a lexicalized treebank can be created by translating a target language corpus into a number of related source languages and parsing the translations using models trained on the source language treebanks.1 These annotations are projected to the target language using separate word alignments for each source language, combined into a single graph for each sentence and decoded (Sagae and Lavie, 2006), resulting in a treebank for the target language, Faroese in the case of Tyers et al.’s and our experiments. Tyers et al (2018) describe a method for creating synthetic treebanks for Faroese based on previous work which uses machine translation and word alignments to transfer trees from source language(s) to the target language. It is shown that, for Faroese, a combination of the four source languages (multi-source projection) is superior to individual language projection
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.