Surface Statistics of an Unknown Language Indicate How to Parse It

Dingquan Wang,Jason Eisner

doi:10.1162/tacl_a_00248

Abstract

We introduce a novel framework for delexicalized dependency parsing in a new language. We show that useful features of the target language can be extracted automatically from an unparsed corpus, which consists only of gold part-of-speech (POS) sequences. Providing these features to our neural parser enables it to parse sequences like those in the corpus. Strikingly, our system has no supervision in the target language. Rather, it is a multilingual system that is trained end-to-end on a variety of other languages, so it learns a feature extractor that works well. We show experimentally across multiple languages: (1) Features computed from the unparsed corpus improve parsing accuracy. (2) Including thousands of synthetic languages in the training yields further improvement. (3) Despite being computed from unparsed corpora, our learned task-specific features beat previous work’s interpretable typological features that require parsed corpora or expert categorization of the language. Our best method improved attachment scores on held-out test languages by an average of 5.6 percentage points over past work that does not inspect the unparsed data (McDonald et al., 2011), and by 20.7 points over past “grammar induction” work that does not use training languages (Naseem et al., 2010).

Highlights

Dependency parsing is one of the core natural language processing tasks
Grammar induction induces an explicit grammar from the unparsed corpus, such as a probabilistic context-free grammar (PCFG), and uses that to parse sentences of the language
We showed how to build a “language-agnostic” delexicalized dependency parser that can better parse sentences of an unknown language by exploiting an unparsed corpus of that language

Summary

Introduction

Dependency parsing is one of the core natural language processing tasks. It aims to parse a given sentence into its dependency tree: a directed graph of labeled syntactic relations between words. Grammar induction induces an explicit grammar from the unparsed corpus, such as a probabilistic context-free grammar (PCFG), and uses that to parse sentences of the language. This approach has encountered two major difficulties:. Our approach is inspired by Wang and Eisner (2017), who use an unparsed but tagged corpus to predict the fine-grained syntactic typology of Transactions of the Association for Computational Linguistics, vol 6, pp. They may predict that about 70% of the direct objects fall to the right of the verb Their system is trained on a large number of (unparsed corpus, true typology) pairs, each representing a different language. The basic idea is that instead of predicting interpretable typological properties of a language as Wang and Eisner (2017) did, we will predict a language-specific version of the scoring function that a parser uses to choose among various actions or substructures

Unsupervised Parsing with Supervised Tuning

Why Synthetic Training Languages?

Task Formulation

Per-Language Learning

Multi-Language Learning

Exploiting Parallel Data

Situating Our Work

The Typology Component

The Parsing Architecture

Training Objective

Training Algorithm

Basic Setup

Comparison Among Architectures

32 Size 6o4f unp1a2r8sed c25o6rpus512 1024

Comparison to SST

Selected Hyperparameter Settings

Performance on Noisy Tag Sequences

Analysis by Dependency Relation Type

Final Evaluation on Test Data

Findings

10 Conclusion and Future Work

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Transactions of the Association for Computational Linguistics	Publication Date: Dec 1, 2018
Citations: 67	License type: cc-by

R Discovery Prime

R Discovery Prime

Surface Statistics of an Unknown Language Indicate How to Parse It

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Transactions of the Association for Computational Linguistics

Lead the way for us

Similar Papers

MUCS 2021: Multilingual and Code-Switching ASR Challenges for Low Resource Indian Languages
Shreya Khare ... Sunayana Sitaram
-
Shreya Khare, et. al.Shreya Khare ... Sunayana Sitaram
30 Aug 2021
MUCS 2021: Multilingual and Code-Switching ASR Challenges for Low Resource Indian Languages
Shreya Khare ... Sunayana Sitaram

Experts Versus All-Rounders: Target Language Extraction for Multiple Target Languages
Haizhou Li ... Marvin Borsdorf
-
Haizhou Li, et. al.Haizhou Li ... Marvin Borsdorf
23 May 2022
23 May 2022

How Are Multilingual Systems Constructed: Characterizing Language Use and Selection in Open-Source Multilingual Software
Austin Marino ... Haoran Yang
ACM Transactions on Software Engineering and Methodology | VOL. 33
Austin Marino, et. al.Austin Marino ... Haoran Yang
14 Mar 2024
ACM Transactions on Software Engineering and Methodology | VOL. 33

Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain
Sina Hamidi Ghalehjegh ... Aanchan Mohan
Speech Communication | VOL. 56
Sina Hamidi Ghalehjegh, et. al.Sina Hamidi Ghalehjegh ... Aanchan Mohan
24 Jul 2013
Speech Communication | VOL. 56

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Surface Statistics of an Unknown Language Indicate How to Parse It

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Transactions of the Association for Computational Linguistics