Abstract

We introduce a novel framework for delexicalized dependency parsing in a new language. We show that useful features of the target language can be extracted automatically from an unparsed corpus, which consists only of gold part-of-speech (POS) sequences. Providing these features to our neural parser enables it to parse sequences like those in the corpus. Strikingly, our system has no supervision in the target language. Rather, it is a multilingual system that is trained end-to-end on a variety of other languages, so it learns a feature extractor that works well. We show experimentally across multiple languages: (1) Features computed from the unparsed corpus improve parsing accuracy. (2) Including thousands of synthetic languages in the training yields further improvement. (3) Despite being computed from unparsed corpora, our learned task-specific features beat previous work’s interpretable typological features that require parsed corpora or expert categorization of the language. Our best method improved attachment scores on held-out test languages by an average of 5.6 percentage points over past work that does not inspect the unparsed data (McDonald et al., 2011), and by 20.7 points over past “grammar induction” work that does not use training languages (Naseem et al., 2010).

Highlights

  • Dependency parsing is one of the core natural language processing tasks

  • Grammar induction induces an explicit grammar from the unparsed corpus, such as a probabilistic context-free grammar (PCFG), and uses that to parse sentences of the language

  • We showed how to build a “language-agnostic” delexicalized dependency parser that can better parse sentences of an unknown language by exploiting an unparsed corpus of that language

Read more

Summary

Introduction

Dependency parsing is one of the core natural language processing tasks. It aims to parse a given sentence into its dependency tree: a directed graph of labeled syntactic relations between words. Grammar induction induces an explicit grammar from the unparsed corpus, such as a probabilistic context-free grammar (PCFG), and uses that to parse sentences of the language. This approach has encountered two major difficulties:. Our approach is inspired by Wang and Eisner (2017), who use an unparsed but tagged corpus to predict the fine-grained syntactic typology of Transactions of the Association for Computational Linguistics, vol 6, pp. They may predict that about 70% of the direct objects fall to the right of the verb Their system is trained on a large number of (unparsed corpus, true typology) pairs, each representing a different language. The basic idea is that instead of predicting interpretable typological properties of a language as Wang and Eisner (2017) did, we will predict a language-specific version of the scoring function that a parser uses to choose among various actions or substructures

Unsupervised Parsing with Supervised Tuning
Why Synthetic Training Languages?
Task Formulation
Per-Language Learning
Multi-Language Learning
Exploiting Parallel Data
Situating Our Work
The Typology Component
The Parsing Architecture
Training Objective
Training Algorithm
Basic Setup
Comparison Among Architectures
32 Size 6o4f unp1a2r8sed c25o6rpus512 1024
Comparison to SST
Selected Hyperparameter Settings
Performance on Noisy Tag Sequences
Analysis by Dependency Relation Type
Final Evaluation on Test Data
Findings
10 Conclusion and Future Work
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call