Abstract
We introduce a novel framework for delexicalized dependency parsing in a new language. We show that useful features of the target language can be extracted automatically from an unparsed corpus, which consists only of gold part-of-speech (POS) sequences. Providing these features to our neural parser enables it to parse sequences like those in the corpus. Strikingly, our system has no supervision in the target language. Rather, it is a multilingual system that is trained end-to-end on a variety of other languages, so it learns a feature extractor that works well. We show experimentally across multiple languages: (1) Features computed from the unparsed corpus improve parsing accuracy. (2) Including thousands of synthetic languages in the training yields further improvement. (3) Despite being computed from unparsed corpora, our learned task-specific features beat previous work’s interpretable typological features that require parsed corpora or expert categorization of the language. Our best method improved attachment scores on held-out test languages by an average of 5.6 percentage points over past work that does not inspect the unparsed data (McDonald et al., 2011), and by 20.7 points over past “grammar induction” work that does not use training languages (Naseem et al., 2010).
Highlights
Dependency parsing is one of the core natural language processing tasks
Grammar induction induces an explicit grammar from the unparsed corpus, such as a probabilistic context-free grammar (PCFG), and uses that to parse sentences of the language
We showed how to build a “language-agnostic” delexicalized dependency parser that can better parse sentences of an unknown language by exploiting an unparsed corpus of that language
Summary
Dependency parsing is one of the core natural language processing tasks. It aims to parse a given sentence into its dependency tree: a directed graph of labeled syntactic relations between words. Grammar induction induces an explicit grammar from the unparsed corpus, such as a probabilistic context-free grammar (PCFG), and uses that to parse sentences of the language. This approach has encountered two major difficulties:. Our approach is inspired by Wang and Eisner (2017), who use an unparsed but tagged corpus to predict the fine-grained syntactic typology of Transactions of the Association for Computational Linguistics, vol 6, pp. They may predict that about 70% of the direct objects fall to the right of the verb Their system is trained on a large number of (unparsed corpus, true typology) pairs, each representing a different language. The basic idea is that instead of predicting interpretable typological properties of a language as Wang and Eisner (2017) did, we will predict a language-specific version of the scoring function that a parser uses to choose among various actions or substructures
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have