Detecting Ad Hoc Rules for Treebank Development

Markus Dickinson

doi:10.33011/lilt.v4i.1225

Abstract

We outline a method of detecting ad hoc, or anomalous, rules in treebank grammars, by exploiting the fact that such rules do not fit with the rest of the grammar. Ad hoc rules are rules used for specific constructions in one data set and unlikely to be used again. These include ungeneralizable rules, erroneous rules, rules for ungrammatical text, and rules which are not consistent with the rest of the annotation scheme. Based on the idea that valid rules should receive support from other rules in the grammar, we develop two methods for detecting ad hoc rules in flat treebanks and show they are successful in detecting such rules. Although one can put some linguistic knowledge into determining rule similarity and dissimilarity, the methods work best by using a simple, modified Levenshtein distance. We illustrate this on the English Wall Street Journal treebank and the German TIGER treebank. For the latter, we extend the method to formalisms incorporating discontinuous constituents, employing CFG-like rules for the comparisons.

Highlights

Our starting point for comparing rules comes from a method of annotation error detection which searches for inconsistency of labeling within local trees (Dickinson and Meurers, 2005b)
We have presented work on detecting ad hoc rules in treebanks, where an ad hoc rule is an annotation error, covers an ungrammatical sentence, reveals issues with the uniformity of an annotation scheme or is a rule that does not generalize well
We started with a notion of equivalence classes, the idea that different rules express the same linguistic content, with respect to valency, and moved on to more general similarity metrics

Summary

Motivation

When extracting rules from treebanks, especially constituency-based treebanks employing flat structures, grammars often limit the set of rules (e.g., Charniak, 1996), due to the large number of rules (Krotov et al, 1998) and “leaky” rules that can lead to mis-analysis (Foth and Menzel, 2006). When ungrammatical or non-standard text is used, treebanks employ rules to cover it, but do not usually indicate ungrammaticality in the annotation These rules are only to be used in certain situations, e.g., for typographical conventions such as footnotes, and pose a problem if the set of treebank rules is intended to accurately capture the grammar of a language. This is true in the case of precision grammars for grammar checking and generation (e.g., Wagner et al, 2007, Bender et al, 2004), and in applications like intelligent computer-aided language learning, where learner input is parsed to detect what is correct or not (e.g., Metcalf and Boyd, 2006, Dickinson and Lee, 2009). We here highlight the theoretical issues involved across different scenarios and in many parts provide more extensive evaluation

A starting point: valency in flat treebanks

Background

Basic valency inconsistencies

Ad hoc detection with equivalence classes

Ad hoc detection without equivalence classes

Evaluation of different methods

Evaluation of basic valency inconsistencies

Evaluation of ad hoc detection with equivalence classes

Evaluation of ad hoc detection without equivalence classes

Results on test data

Discontinuous constituents

An appropriate representation

Evaluation for discontinuous constituents

Related work

Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Linguistic Issues in Language Technology	Publication Date: Apr 1, 2011
Citations: 4	License type: cc-by

R Discovery Prime

R Discovery Prime

Detecting Ad Hoc Rules for Treebank Development

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Linguistic Issues in Language Technology

Lead the way for us

Similar Papers

Improving Quality of Data Exchange Files. An Industrial Case Study
Günter Fleck ... Michael Moser
-
Günter Fleck, et. al.Günter Fleck ... Michael Moser
01 Jan 2019
01 Jan 2019

명시적 문법 지식과 학습자 인식 연구 -스페인어 원망동사의 종속절 시제 사용을 중심으로-
Eunjung You
Estudios Hispánicos | VOL. 100
Eunjung YouEunjung You
30 Sep 2021
Estudios Hispánicos | VOL. 100

Incorporating linguistics constraints into inductive logic programming
James Cussens ... Stephen Pulman
-
James Cussens, et. al.James Cussens ... Stephen Pulman
01 Jan 1999
01 Jan 1999

Models, Inference, and Implementation for Scalable Probabilistic Models of Text

-

01 Jan 2014
01 Jan 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Detecting Ad Hoc Rules for Treebank Development

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Linguistic Issues in Language Technology