LinGO Redwoods

Stephan Oepen,Dan Flickinger,Christopher D Manning,Kristina Toutanova

doi:10.1007/s11168-004-7430-4

Abstract

The LinGO Redwoods initiative is a seed activity in the design and development of a new type of treebank. A treebank is a (typically hand-built) collection of natural language utterances and associated linguistic analyses; typical treebanks—as for example the widely recognized Penn Treebank (Marcus, Santorini, & Marcinkiewicz, 1993), the Prague Dependency Treebank (Hajic, 1998), or the German TiGer Corpus (Skut, Krenn, Brants, & Uszkoreit, 1997)—assign syntactic phrase structure or tectogrammatical dependency trees over sentences taken from a naturallyoccuring source, often newspaper text. Applications of existing treebanks fall into two broad categories: (i) use of an annotated corpus in empirical linguistics as a source of structured language data and distributional patterns and (ii) use of the treebank for the acquisition (e.g. using stochastic or machine learning approaches) and evaluation of parsing systems. While several mediumto large-scale treebanks exist for English (and some for other major languages), all pre-existing publicly available resources exhibit the following limitations: (i) the depth of linguistic information recorded in these treebanks is comparatively shallow, (ii) the design and format of linguistic representation in the treebank hard-wires a small, predefined range of ways in which information can be extracted from the treebank, and (iii) representations in existing treebanks are static and over the (often yearor decade-long) evolution of a large-scale treebank tend to fall behind theoretical advances in formal linguistics and grammatical representation. LinGO Redwoods aims at the development of a novel treebanking methodology, (i) rich in nature and dynamic in both (ii) the ways linguistic data can be retrieved from the treebank in varying granularity and (iii) the constant evolution and regular updating of the treebank itself, synchronized to the development of ideas in syntactic theory. Starting in October 2001, the project is aiming to build the foundations for this new type of treebank, develop a basic set of tools required for treebank construction and maintenance, and construct an initial set of 10,000 annotated trees to be distributed together with the tools under an open-source license. Building a largescale treebank, disseminating it, and positioning the corpus as a widely-accepted resource is a multi-year effort; the results of this seeding activity will serve as a proof of concept for the novel approach that is expected to enable the LinGO group at CSLI both to disseminate the approach to the wider academic and industrial audience and to secure appropriate funding for the realization and exploitation of a larger treebank. The purpose of publication at this early stage is three-fold: (i) to encourage feedback on the Redwoods approach from a broader academic audience, (ii) to facilitate exchange with related work at other sites, and (iii) to invite additional collaborators to contribute to the construction of the Redwoods treebank or start its exploitation as early-access versions become available. This paper is an updated version of an earlier project report published by Oepen, Callahan, Flickinger, and Manning (2002); changes over that version include more recent numbers on the current Redwoods development status, inclusion of an example of discriminator-based disambiguation, and minor adaptations and corrections in various parts of the discussion.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

LinGO Redwoods

Abstract

Talk to us

Similar Papers

More From: Research on Language and Computation

Lead the way for us

Journal: Research on Language and Computation	Publication Date: Dec 1, 2004
Citations: 115

Similar Papers

The LinGO Redwoods treebank motivation and preliminary applications
Stephan Oepen ... Dan Flickinger
-
Stephan Oepen, et. al.Stephan Oepen ... Dan Flickinger
01 Jan 2002
01 Jan 2002

Automatic Processing of Linguistic Data as a Feedback for Linguistic Theory
Vladislav Kuboň ... Jiří Mírovský
-
Vladislav Kuboň, et. al.Vladislav Kuboň ... Jiří Mírovský
01 Jan 2013
01 Jan 2013

The functional structure of the sentence, and cartography
Luigi Rizzi ... Marcel Den Dikken
-
Luigi Rizzi, et. al.Luigi Rizzi ... Marcel Den Dikken
25 Jul 2013
25 Jul 2013

Unifying syntactic theory and sentence processing difficulty through a connectionist minimalist parser
Sabrina Gerth ... Peter Beim Graben
Cognitive Neurodynamics | VOL. 3
Sabrina Gerth, et. al.Sabrina Gerth ... Peter Beim Graben
01 Oct 2009
Cognitive Neurodynamics | VOL. 3

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

LinGO Redwoods

Abstract

Talk to us

Similar Papers

More From: Research on Language and Computation